key: cord- - mqc mqd authors: roques, lionel; klein, etienne k.; papaïx, julien; sar, antoine; soubeyrand, samuel title: impact of lockdown on the epidemic dynamics of covid- in france date: - - journal: front med (lausanne) doi: . /fmed. . sha: doc_id: cord_uid: mqc mqd the covid- epidemic was reported in the hubei province in china in december and then spread around the world reaching the pandemic stage at the beginning of march . since then, several countries went into lockdown. using a mechanistic-statistical formalism, we estimate the effect of the lockdown in france on the contact rate and the effective reproduction number r(e) of the covid- . we obtain a reduction by a factor (r(e) = . , %-ci: . – . ), compared to the estimates carried out in france at the early stage of the epidemic. we also estimate the fraction of the population that would be infected by the beginning of may, at the official date at which the lockdown should be relaxed. we find a fraction of . % ( %-ci: . – . %) of the total french population, without taking into account the number of recovered individuals before april st, which is not known. this proportion is seemingly too low to reach herd immunity. thus, even if the lockdown strongly mitigated the first epidemic wave, keeping a low value of r(e) is crucial to avoid an uncontrolled second wave (initiated with much more infectious cases than the first wave) and to hence avoid the saturation of hospital facilities. covid- epidemic was reported in the hubei province in china in december and then spread around the world reaching the pandemic stage at the beginning of march ( ) . to slow down the epidemic, several countries went into lockdown with different levels of restrictions. in the hubei province, where the lockdown has been set long before the other countries (on january ), the epidemic has reached a plateau, with only sporadic new cases by april [from the data of johns hopkins university center for systems science and engineering ( ) ]. in france, the first cases of covid- were detected on january , and the lockdown has been set on march . this national lockdown means important restrictions on movement, with a mandatory home confinement except for essential journeys including food shopping, care, h individual sporting activity and work when teleworking is not possible, and closing of the borders of the schengen area. it also includes closures of schools and universities as well as all non-essential public places, including shops (except for food shopping), restaurants, cafés, cinemas, and nightclubs. the basic reproduction number r corresponds to the expected number of new cases generated by a single infectious case in a fully susceptible population ( ) . several studies, mostly based on chinese data, aimed at estimating the r associated with the covid- epidemic, leading to values from . to . , with an average of . ( ) . as the value of r can be interpreted as the product of the contact rate and of the duration of the infectious period, and since the objective of the lockdown and associated restriction strategies are precisely to decrease the contact rate, an important effect on the number r e of secondary cases generated by an infectious individual is to be expected. this value r e is often referred to as "effective reproduction number, " and corresponds to the counterpart of r in a population that is not fully susceptible ( ) . if r e > , the number of infectious cases in the population follows an increasing trend, and the larger r e , the faster this trend. on the contrary, if r e < , the epidemic will gradually die out. the control measures in china have been shown to have a significant effect on the covid- epidemic, with growth rates that shifted from positive to negative values (corresponding to r e < ) within weeks ( ). the study ( ) showed that containment policies in hubei province also led to a subexponential growth in the number of cases, consistent with a decrease in the effective reproduction number r e . fitting a seir epidemic model to time series of reported cases from provinces in china, tian et al. ( ) found a basic reproductive number r = . before the implementation of the emergency response in china, a value that was divided by more than once the control measures were fully effective. using contact surveys data for wuhan and shanghai it was estimated in zhang et al. ( ) that the effective reproduction number was divided by a factor in wuhan and . in shanghai. standard epidemiological models generally rely on sir (susceptible-infected-removed) systems of ordinary differential equations and their extensions [for examples of application to the covid- epidemic, see ( , ) ]. with these models, and more generally for most deterministic models based on differential equations, when the loss of information due to the observation process is heavy, specific approaches have to be used to bridge the gap between the models and the data. one of these approaches is based on the mechanistic-statistical formalism, which uses a probabilistic model to connect the data collection process and the latent variable described by the ode model. milestone articles and textbook have been written about this approach or related approaches ( ) , which is becoming standard in ecology ( , ) . the application of this approach to human epidemiological data is still rare. in a previous study ( ) , we applied this framework to the data corresponding to the beginning of the epidemic in france (from february to march ), with a sir model. our primary objective was to assess the infection fatality ratio (ifr), defined as the number of deaths divided by the number of infected cases. as the number of people that have been infected is not known, this quantity cannot be directly measured, even now (on april ). the mechanistic-statistical framework allowed us to compute an ifr of . % ( %-ci: . - . %), which was consistent with previous findings in china ( . %) and in the uk ( . %) ( ) and lower than the value previously computed on the diamond princess cruse ship data ( . %) ( ) . in this previous study, we also computed the r in france, and we found a value of . ( %-ci: . - . ). although the number of tests at that stage was low, an advantage of working with the data from the beginning of the epidemic was that the initial state of the epidemic was known. here, we develop a new mechanistic-statistical approach, based on a sird model (d being the dead cases compartment), in the aim of • estimating the effect of the lockdown in france on the contact rate and the effective reproduction number r e ; • estimating the number of infectious individuals and the fraction of the population that has been infected by the beginning of may (at the official date at which the lockdown should be relaxed). we obtained the number of positive cases and deaths in france, day by day from santé publique france ( ) , from march to april . we obtained weekly data on the number of individuals tested (in private laboratories and hospitals) from the same source. we assumed that during each of these weeks the number of tests per day was constant. this assumption is consistent with the small variations between the number of tests during the first week ( , ) and the second week of observation ( , the mechanistic-statistical framework consists in the combination of a mechanistic model that describes the epidemiological process, a probabilistic observation model and an inference procedure. the dynamics of the epidemic are described by the following sird compartmental model: with s the susceptible population, i the infectious population, r the recovered population, d the number of deaths due to the epidemic and n the total population. for simplicity, we assume that n is constant, equal to the current french population, thereby neglecting the effect of the small variations of the population on the coefficient α/n. the parameter α is the contact rate (to be estimated) and /β is the mean time until an infectious becomes recovered. based on the results in zhou et al. ( ) , the median period of viral shedding is days, but the infectiousness tends to decay before the end of this period: the results in he et al. ( ) indicate that infectiousness starts - days before symptom onset and declines significantly days after symptom onset. based on these observations we assume here that the mean duration of the infectiousness period is /β = days. in li et al. ( ) , the duration of the incubation period was estimated to have a mean of . days. thus, the mean duration of the non-infectious exposed period is relatively short (about - days), and can be neglected without much differences on the results, as shown in liu et al. ( ) . inclusion of an exposed compartment (as in seir models) is particularly relevant when exposed individuals can indirectly transmit the disease e.g., through insect vectors [e.g., ( ) ], which is seemingly not the case for coronaviruses. the parameter γ corresponds to the death rate of the infectious (to be estimated). the model is started at a date t corresponding to april st. the initial number of infectious i(t ) = i is not known and will be estimated. the total number of recovered at time t is also not known. however, as the compartment r has no feedback on the other compartments, we may assume without loss of generality that r(t ) = , thereby considering only the new recovered individuals, starting from the date t . we fixed d(t ) = , the number of deaths at hospital by march . the initial s population at the beginning of the period, should still be close to the total french population: by march only , cases had been observed in france, corresponding to . % of the total population. a factor had been estimated in roques et al. ( ) between the cumulated number of observed cases and the actual number of cases at the beginning of the epidemic. even though this factor may have changed, this means that the proportion of the total population that has been infected by march is still small. we can get an upper bound for the cumulated number of cases by march by dividing the number of hospital deaths at the end of the observation period ( , by april ) by the hospital ifr [ . %, as estimated in ( )] leading to about million cases. this means that the value of s(t ) is between and million cases. for our computation, we assumed that s(t ) = · , corresponding to about . % of the french population. as shown in figure s , our results are not much sensitive to the value of s(t ) (at least when s/n remains close to ). the ode system ( ) was solved thanks to a standard numerical algorithm, using matlab r ode solver. the number of cases tested positive on day t, denoted byδ t , is modeled by independent binomial laws, conditionally on the number of tests n t carried out on day t, and on p t the probability of being tested positive in this sample: the tested population consists of a fraction of the infectious cases and a fraction of the susceptibles: n t = τ (t) i(t)+τ (t) s(t). thus, with κ t : = τ (t)/τ (t), the relative probability of undergoing a screening test for an individual of type s vs an individual of type i. we assumed that the ratio κ was independent of t over the observation period. the coefficient σ corresponds to the sensitivity of the test. in most cases, rt-pcr tests have been used and existing data indicate that the sensitivity of this test using pharyngeal and nasal swabs is about − % ( ). we assumed here σ = . ( % sensitivity). each day, the number of new observed deaths (excluding nursing homes), denoted byμ t , is modeled by independent poisson distributions conditionally on the process d(t), with mean value d(t) − d(t − ) (which measures the daily increment in the number of deaths): note that the time t in ( ) is a continuous variable, while the observationsδ t andμ t are reported at discrete times. for the sake of simplicity, we used the same notation t for the days in both the discrete and continuous cases. in the formulas ( ) and ( ) i(t), s(t), and d(t) are computed at the end of day t. the unknown parameters are α, γ , κ, and i . we used a bayesian method ( ) to estimate the posterior distribution of these parameters. the likelihood l is defined as the probability of the observations (here, the increments {δ t ,μ t }) conditionally on the parameters. using the observation models ( ) and ( ), and using the assumption that the incrementsδ t andμ t are independent conditionally on the underlying sird process and that the number of tests n t is known, we get: with t i the date of the first observation and t f the date of the last observation. in this expression l(α, γ , κ, i ) depends on α, γ , κ, i through p t and d(t). the posterior distribution corresponds to the distribution of the parameters conditionally on the observations: p(α, γ , κ, i |{δ t ,μ t }) = l(α, γ , κ, i ) π(α, γ , κ, i ) c , where π(α, γ , κ, i ) corresponds to the prior distribution of the parameters (detailed below) and c is a normalization constant independent of the parameters. regarding the contact rate α, the initial number of infectious cases i and the probability κ, we used independent noninformative uniform prior distributions in the intervals α ∈ ( , ), i ∈ ( , ) and κ ∈ ( , ). to overcome identifiability issues, we used an informative prior distribution for γ . this distribution, say f g , was obtained in roques et al. ( ) during the early stage of the epidemic (f g is depicted in figure s ). in roques et al. ( ) , the number of infectious cases i at the beginning of the epidemic was known (equal to ), and did not need to be estimated. thus, we estimated in roques et al. ( ) the distribution of the parameter γ by computing the distribution of the infectious class and using the formula d ′ (t) = γ i(t) together with mortality data (which were not used for the estimation of the other parameters, unlike in the present study). finally, the prior distribution is defined as follows: π(α, γ , κ, i ) = (α,κ,i )∈( , )×( , )×( , ) f g (γ ). the numerical computation of the posterior distribution is performed with a metropolis-hastings (mcmc) algorithm, using independent chains, each of which with iterations, starting from the posterior mode. to find the posterior mode we used the bfgs constrained minimization algorithm, applied to − ln(l) − ln(π), via the matlab r function fmincon. in order to find a global minimum, we applied this method starting from , random initial values. the matlab r codes are available as supplementary material. denote by (α * , γ * , κ * , i * ) the posterior mode, and s * (t), i * (t), r * (t), d * (t) the solutions of the system ( ) associated with these parameter values. the observation model ( ) implies that the associated expected number of cases tested positive on day t is n t p * t (expectation of a binomial) with the observation model ( ) implies that the expected cumulated number of deaths on day t is d * (t). to assess model fit, we compared these expectations and the observations, i.e., the cumulated number of cases tested positive, t : = c + {s=t ,...,t + }δs with c the number of cases tested positive by march (c = , ) and the cumulated number of deaths m t : = m + {s=t ,...,t + }μs , with m the number of reported deaths (at hospital) by march (m = ). the results are presented in figure . we observe a good match with the data. the pairwise posterior distributions of the parameters (α, i ), (α, γ ), (α, κ), (γ , i ), (γ , κ), (κ, i ) are depicted as figure s . with the exception of the parameter γ (figure s ), for which we chose an informative prior, the posterior distribution is clearly different from the prior distribution, showing that new information was indeed contained in the data. the effective reproduction number can be simply derived from the relation r e = α/(β + γ ) when s is close to n ( ). the distribution of r e is therefore easily derived from the marginal frontiers in medicine | www.frontiersin.org posterior distribution of the contact rate α (since we assumed β = / ; see section . ). it is depicted in figure . we observe a mean value of r e of . ( %-ci: . - . ). the marginal posterior distribution of i indicates that the number of infectious individuals at the beginning of the considered period (i.e., april st) is . · ( %-ci: . · − . · ). the computation of the solution of ( ) with the posterior distribution of the parameters leads to a number of infectious i(t f ) = . · and a total number of infected cases (including recovered) (i + r)(t f ) = . · at the end of the observation period (april ). by may , if the restriction policies remain unchanged, we get a forecast of i(t) = . · infectious cases ( %-ci: . · − . · ) and (i + r)(t) = . · infected cases including recovered ( %-ci: . · − . · ). the dynamics of the distributions of i and i + r are depicted in figure . by may , the total number of infected cases (including recovered) therefore corresponds to a fraction of . % of the total french population. this value does not include the recovered cases before april st. many studies focused on the estimation of the basic reproductive number r of the covid- epidemic, based on data-driven methods and mathematical models [e.g., ( , ) ] describing the epidemic from its beginning. in average, the estimated value of r was about . . we focused here on an observation period that began after the lockdown was set in france. we obtained an effective reproduction number that was divided by a factor , compared to the estimate of the r carried out in france at the early stage of the epidemic, before the country went into lockdown [a value r = . was obtained in ( ) ]. this indicates that the restriction policies were very efficient in decreasing the contact rate and therefore the number of infectious cases. in particular, the value r e = . is significantly below the threshold value were the epidemic starts dying out. the decay in the number of infectious cases can also be observed from our simulations. it has to be noted that, although the number of infectious cases is a latent, or "unobserved" process, the mechanistic-statistical framework allowed us to estimate its value (figure ) . the cumulated number of infected cases that we obtained by may (i +r) corresponds to a fraction of . % ( %-ci: . - . %) of the total french population, without taking into account the number of recovered individuals before april st, which is not known. based on a value r = . , the herd immunity threshold, corresponding to the minimum fraction of the population that must have immunity to stop the epidemic, would be − /r ≈ % [a threshold of % was proposed in ( ) ]. this proportion will probably not be reached by may . as emphasized by angot ( ) , a too fast relaxation of the lockdown-related restrictions before herd immunity is reached or efficient prophylaxis is developed), would expose the population to an uncontrolled second wave of infection. in the worst-case scenario, the effective reproduction number r e would approach the initially estimated value of r , and the second wave would start with about . · infectious individuals (in comparison with the few cases that initiated the first wave in france) and about · susceptible individuals. keeping a low value of r e is therefore crucial to avoid the saturation of hospital facilities. we deliberately chose a parsimonious mechanistic model with a few parameters to avoid identifiability issues. possible extensions include stage-structured models, where the infectious class i and the contact rate α would depend on another variable: i = i(t, τ ) and α = α(t, τ ) with τ the time since infection, to take into account the dynamics of the viral load on the infectiousness. see e.g., murray ( ) (chapter . ) for an introduction to such modeling approaches. another insightful extension would consist in using spatially-explicit models, e.g. reaction-diffusion models ( ) to describe the spatial spread of the epidemic, and to be able to estimate local values for the parameter r e and the number of susceptible cases. although herd immunity is far from being reached at the country scale, it is likely that the fraction of immune individuals strongly varies over the territory, with possible local immunity effects [e.g., by april the proportion of people with confirmed sars-cov- infection based on antibody detection was of % in a high-school located in northern france ( ) ]. publicly available datasets were analyzed in this study. this data can be found here: https://www.gouvernement.fr/infocoronavirus/carte-et-donnees https://geodes.santepublique france.fr and https://ourworldindata.org/coronavirus-testing. lr, ek, jp, as, and ss conceived the model and designed the statistical analysis. lr and ss wrote the paper. lr carried out the numerical computations. all authors reviewed the manuscript. world health organization. who director-general's opening remarks at the media briefing on covid- an interactive web-based dashboard to track covid- in real time the reproductive number of covid- is higher compared to sars coronavirus epidemiology of transmissible diseases after elimination the effect of human mobility and control measures on the covid- epidemic in china effective containment explains subexponential growth in recent confirmed covid- cases in china an investigation of transmission control measures during the first days of the covid- epidemic in china age profile of susceptibility, mixing, and social distancing shape the dynamics of the novel coronavirus disease outbreak in china understanding unreported cases in the -ncov epidemic outbreak in wuhan, china, and the importance of major public health interventions the effect of control strategies to reduce social mixing on outcomes of the covid- epidemic in wuhan, china: a modelling study hierarchical bayesian models for predicting the spread of ecological processes modelling population dynamics in realistic landscapes with linear elements: a mechanistic-statistical reaction-diffusion approach dating and localizing an invasion from post-introduction data and a coupled reaction-diffusion-absorption model using early data to estimate the actual infection fatality ratio from covid- in france estimates of the severity of coronavirus disease : a model-based analysis estimating the infection and case fatality ratio for coronavirus disease (covid- ) using age-adjusted data from the outbreak on the diamond princess cruise ship covid- : point épidÉmiologique du avril résidents en établissements d'hébergement pour personnes âgées en clinical course and risk factors for mortality of adult inpatients with covid- in wuhan, china: a retrospective cohort study temporal dynamics in viral shedding and transmissibility of covid- early transmission dynamics in wuhan, china, of novel coronavirus-infected pneumonia a covid- epidemic model with latency period analysis of transmission dynamics for zika virus on networks detection of sars-cov- in different types of clinical specimens preliminary estimation of the basic reproduction number of novel coronavirus to : a data-driven analysis in the early phase of the outbreak impact of non-pharmaceutical interventions (npis) to reduce covid- mortality and healthcare demand early estimations of the impact of general lockdown to control the covid- epidemic in france spatial ecology via reaction-diffusion equations cluster of covid- in northern france: a retrospective closed cohort study. medrxiv effect of a one-month lockdown on the epidemic dynamics of covid- in france this manuscript has been released as a pre-print at medrxiv ( ) . the supplementary material for this article can be found online at: https://www.frontiersin.org/articles/ . /fmed. . /full#supplementary-material conflict of interest: the authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.copyright © roques, klein, papaïx, sar and soubeyrand. this is an open-access article distributed under the terms of the creative commons attribution license (cc by). the use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. no use, distribution or reproduction is permitted which does not comply with these terms. key: cord- -fl llkoj authors: meltzer, martin i.; gambhir, manoj; atkins, charisma y.; swerdlow, david l. title: standardizing scenarios to assess the need to respond to an influenza pandemic date: - - journal: clin infect dis doi: . /cid/civ sha: doc_id: cord_uid: fl llkoj nan an outbreak of human infections with an avian influenza a(h n ) virus was first reported in eastern china by the world health organization on april [ ] . this novel influenza virus was fatal in approximately onethird of the confirmed cases detected in the months following its initial identification [ ] , and limited humanto-human h n virus transmission could not be excluded in some chinese clusters of cases [ , ] . there was, and still is, the possibility that the virus would mutate to the point where there would be sustained human-to-human transmission. given that most of the human population has no prior immunity (either due to natural challenge or vaccine induced), such a strain presents the danger of starting an influenza pandemic. in response to such a threat, the joint modeling unit at the centers for disease control and prevention (cdc) was asked to conduct a rapid assessment of both the potential burden of unmitigated disease and the possible impacts of different mitigation measures. we were tasked to evaluate the following interventions: invasive mechanical ventilators, influenza antiviral drugs for treatment (but not large-scale prophylaxis), influenza vaccines, respiratory protective devices for healthcare workers and surgical face masks for patients, school closings to reduce transmission, and airport-based screening to identify those ill with novel influenza virus entering the united states. this supplement presents reports on the methods and estimates for the first listed interventions, and in this introduction we outline the general approach and standardized epidemiological assumptions used in all the articles. given that there had not yet been (and subsequently has not been to date) a pandemic caused by the h n virus, there are no relevant large-population data concerning transmission and clinical impacts of h n . we therefore had to consider the potential impacts of disease and interventions for a not fully defined pandemic (ie, a pandemic caused by a generic influenza strain hxny). thus, any model that we built had to allow for a wide range in virus transmissibility and resulting clinical impact. the models had to also fully consider a range of effectiveness of interventions-for example, influenza antiviral drugs could be less effective against the next influenza strain causing a pandemic. given these uncertainties, and the need for a rapid assessment of a large number of factors, the models produced had to meet a number of specifications: had to be produced in a manner that would allow the models to be easily transferred to other units in government and to public health officials, and subsequently used by people who did not build them; had to provide easy identification of all input variables, their values, and ability to rapidly change those values; can be easily stored and resurrected for future use and reference at some unspecified time in the future; and, the results from each model can be readily compared to each other. in response to these specifications, we decided to require that each model be built in a spreadsheet format, and that we would essentially have model for each intervention considered. meeting these specifications had the added value of producing models that readily fit into the existing cdc emergency operations response structure. in this structure, groups called task forces are formed to focus on particular aspects of a response to a public health emergency. for example, for an influenza pandemic response, there are usually task forces that focus on vaccines (eg, recommendations regarding prioritization of vaccine supplies, issues related to distribution), medical countermeasures (eg, recommendations regarding use of drugs for treatment and prophylaxis, use of personal protective equipment such as face masks), and nonpharmaceutical interventions (eg, recommendations regarding school closures, border security, and screening). to allow easy comparison between results (a specification), we standardized a risk space defined by using ranges of transmission and clinical severity from a previously published influenza severity assessment framework ( figure ) [ ] . the framework can be used to plot, and compare to historical data, the relative severity of an influenza pandemic (or nonpandemic influenza season). the framework uses scales: a scale of clinical severity, and a scale of transmissibility. the severity scale has a number of components in it, including case-fatality ratio and caseto-hospitalization ratio (table ) [ ] . the transmissibility scale is assessed by considering factors such as the clinical (symptomatic) attack rate in various locales, such as school, community, and workplace (table ) [ ] . we defined and chose a risk space that has a transmission scale that runs from approximately a scale of (eg, comparable to a community attack rate of %- %) to a scale of (community attack rate of > %) ( figure , table ). our defined risk space has a low-end clinical severity scale of , with a case-fatality ratio of . %- . % and a death-to-hospitalization ratio of %- % ( table ). the upper range of severity in our risk space was defined as a scale of , with a case-fatality rate of . %- . %, and a death-to-hospitalization ratio of %- % ( table ). note that the defined risk space encloses the and pandemics ( figure ). it is essential to note that this chosen risk space is illustrative, not definitive. until there are data defining the epidemiological elements of the next pandemic, such as rate of transmission, and case-fatality rate, other risk spaces could be chosen for planning purposes. the models presented in this collection, built to the specifications listed here, allow for rapid alterations in input values. the size and shape of the epidemic curve could impact the effectiveness of interventions. for example, the impact of influenza vaccines depends upon the start of deliveries of large amounts of vaccine compared to the timing of the pandemic peak. thus, we included in the standardized epidemiological scenario epidemic curves, produced using a simple simulation model (see below). we configured the model using clinical attack rates of approximately % and %. these clinical attack rates represent the aggregated attack rate across the entire us population. within the population, subpopulations will typically experience different attack rates (eg, children will experience a higher attack rate than adults - years old-see description later in paper). furthermore, for each attack rate, we assumed starting (seeding) scenarios. we used one scenario in which the pandemic started with the arrival of infectious cases and the other when the pandemic started with infectious cases ( figure ) . to model the epidemic curves, we built a simple, nonprobabilistic (ie, deterministic) model that simulates the spread of influenza through a population by moving the population into groups of susceptible, exposed, infectious, and recovered or death ( table provides values used). we divided the population into age groups ( - , - , - , or ≥ years of age). we table . see main text for additional details. note that the - , - , and - seasons were nonpandemic seasons. they are included to provide reference points regarding the impact of nonpandemic seasons. adapted from reed et al [ ] . modeled the probabilities of daily contact (and thus risk of disease transmission) by constructing a contact matrix using data from the united kingdom (see table a in technical appendix a). we thus produced notably different epidemic curves (figure ). for example, the two % attack rate scenarios peak in weeks and , whereas the % attack rate scenarios peak in weeks and ( figure ). the clinical attack rates by age group are presented in table . obviously, the largest numbers of cases occur in the largest age group of -to -year-olds; however, children in both the - and - age groups have the highest attack rates, indicating a potentially greater degree of vulnerability (table ) . perhaps one of the greatest strengths of the simple models presented in this collection of articles is that they highlight what is for case-fatality ratio and case-hospitalized ratio, scale shows low severity, and scale shows high severity (in bold). source: adapted from reed et al [ ] . a these estimates related to the framework for assessing the impact of influenza pandemics, shown in figure . table ). seeding refers to the number of infectious cases, either or , that arrives nearsimultaneously in the united states to start the pandemic. and is not known about the burden of disease and the potential impact of a planned intervention. to find the weaknesses of what is currently known, a reader need only consult table in each article. these tables list inputs, their assumed values, and data sources. an example of an important unknown is as follows: when estimating the number of respiratory protection devices (eg, face and surgical masks) needed by first responders ( police officers, firefighters, emergency medical technicians), one could assume that first responders will need mask per person whom they encounter with influenza-like illness. the problem is that there are no readily available data that report on the measurement of such [ ] . similarly, when considering the potential use and impact of influenza antiviral drugs, o'hagan et al had to assume that existing influenza antiviral drugs would have the same level of effectiveness against the strain causing the next influenza pandemic as they do with existing influenza strains [ ] . despite these limitations, these simple models make it fairly straightforward to rapidly assess the relative importance of each of the input variables. one assumption that may not be readily appreciated is the impact of the shape of the standardized epidemiological curves used in all the models (figure ). previous influenza pandemics have produced different shapes of deaths over time (figure ). such differences in deaths over time can greatly influence the success of some of the interventions. for example, when considering the number of mechanical ventilators needed at the peak of the pandemic, meltzer et al initially assumed that the peak demand for ventilators would equal approximately % of all patients needing mechanical ventilation [ ] . however, in the % attack rate epidemiological curve (figure ), the number of cases that figure ). note that the data for were recorded once every weeks, whereas all other plots used weekly data. see technical appendix b for further details. occur in the peak days is approximately % of all cases. thus, the authors of the ventilator study conducted a sensitivity analysis by changing from % to % the assumed number of mechanically ventilated patients that occurs at the peak of a pandemic. the articles in this supplement also incorporate other important implicit assumptions. one of the more important is that each article essentially assumes that the healthcare system can absorb and/ or successfully execute any of the interventions so modeled. for example, biggerstaff et al provide some estimates of the impact of influenza vaccination in which it was assumed that million persons could be vaccinated each week [ ] . the us private and public health systems, collectively or separately, have never previously achieved such a rate (though the authors clearly demonstrate that achieving such a rate would have very positive public health outcomes). furthermore, the successful deployment and ultimate impact of each intervention is likely to have a wide variation. schools can close for different lengths of time, antiviral drug prescription and distribution may not be equally efficient in all areas, and healthcare workers and patients may have different levels of compliance in wearing protective gear. finally, readers will note that there are no reports in this collection that consider the simultaneous deployment of ≥ interventions. it is realistic to assume that, during the next influenza pandemic, public health officials, healthcare providers, and other policy makers are likely to enact several interventions at once (eg, close schools, start dispensing antiviral medications, recommend use of protective personal gear). the problem arises in that such multi-intervention models become very scenario specific. for example, different locales are likely to face different unmitigated epidemic curves ( figure ) . thus, researchers who estimate the potential impact of combining several interventions at once have to make a very large increase in the number of assumptions. this makes it more difficult to both generalize the results and to rapidly understand what assumptions are relatively more important. despite these limitations, we believe that the benefits of using these models outweigh the limitations. this assessment is based on our experience of using the models and results produced to help public health leadership reassess us influenza pandemic planning and preparedness. in the response to the h n threat, the most important outcome from policy makers seeing the results from these models was the intense debate concerning the inputs and assumptions. we thus believe that the methodology used here to develop and guide the building of the models in this collection, and the subsequent interpretations and use of the results, can be a useful part of future public health responses. supplement sponsorship. this article appears as part of the supplement titled "cdc modeling efforts in response to a potential public health emergency: influenza a(h n ) as an example," sponsored by the cdc. potential conflicts of interest. all authors: no reported conflicts. all authors have submitted the icmje form for disclosure of potential conflicts of interest. conflicts that the editors consider relevant to the content of the manuscript have been disclosed. standardized epidemiological curves-contact matrix: to model the epidemic curves (figure ), we built a simple, nonprobabilistic (ie, deterministic) model in which we divided the population into age groups ( - , - , - , ≥ years). to measure the risk of contact and possible onward spread between and within each age group, we used the contact matrix shown in table a . for the contact matrix, we used, in the absence of relevant data from the united states, data from the united kingdom [ ] , collected as part of a study called the polymod study that collected contact data from approximately persons living in european countries [ ] . because the uk data are split into -year age groups, we had to aggregate the data into the age groups used in our model. during this aggregation, we ensured that the total number of contacts between any age groups is "equal in any direction" (eg, the number of contacts between - years and - years is the same as those between - years and - years). we used, for this aggregation process, the age distribution of the us population (www. censusscope.org). as noted above, the uk contact data [ ] are split into -year age groups, which we had to aggregate into the age groups used in our models. furthermore, the matrix that we constructed had to meet the condition of being symmetrical. that is, the number of contacts from age group a to age group b should equal the number of contacts in the reverse direction. we begin the explanation of how we built our contact matrix by introducing some notation: the mixing matrix elements of the published matrix [ ] are denoted by u ij , i, j = , . . . , m, where i, j refers to rows and columns, respectively, and m is the number of age groups in the mixing matrix. as the mixing matrix required has fewer age groups than that of the published polymod matrix [ ] , indexed by f, g = , . . . , n, then we let age group u contain narrower age groups i ¼ lð f Þ to uð f Þ. . the contact rate between someone in group i and another in group g is given by we then proceed according to the following steps: . if the us population distribution is such that the population in age group i is n i , we can calculate the population-weighted means of each of the elements d, to obtain contact rates between groups f and g. for f = g, this calculation is simple: . for the elements that are off the diagonal, the calculation becomes more complicated because we need to sum up the correct number of contacts made between each age group. the total reported rates of contact from f to g and g to f are: . theoretically, these values should be equal to one another; however, they differ from one another when calculated from actual reported contact rates (from self-administered surveys such as those conducted by mossong et al [ ] ), and so, to ensure that they are equal, we can average them before calculating the final mixing-matrix elements e fg and e gf : here then, e fg is the rate at which an individual in age group f makes contacts with anyone in age group g, per unit time, as reported in the original data (ie, per day for the original mossong et al [ ] data). an example of this procedure is given below, following the steps above, outlined theoretically: . we begin with the "all contacts" (ie, both conversational and physical) matrix for great britain from the polymod study. the elements of this matrix denote the daily number of contacts between an individual in one -year age group with those in another -year age group. element ( , ) , for example, is the daily number of contacts a person aged - years has with someone aged - years. the fill matrix is as follows: summing the columns of the -year group matrix according to the desired group widths (eg, the first columns are summed to give a -year age group column) gives the following intermediate -group by -group matrix: . next, we obtain a vector whose elements are the numbers of individuals in each of the age groups of the original matrix (here -year width groups, taken from the great britain census; the age distribution should correspond closely with the distribution that held at the time when the contact survey was performed), and we perform a sum of the total number of contacts to produce an aggregated age group (ie, two year age groups are aggregated into one -year age group). elements of the total contact matrix. note that the numbers in the above matrix in these positions differ slightly from those in the y , y calculations outlined above; this is because daily polymod contacts were rounded for the calculations illustrated. . once we have completed this procedure for the whole aggregated total contact matrix, we need to divide our total contact numbers by the correct number of individuals in each age group, to ensure we end up with a matrix that gives the number of contacts per person per day in the relevant age group. for example, the ( , ) element of the final matrix is the ( , ) element of the matrix produced by step ( ) divided by the total number of individuals in the first age group (ie, the sum of the individuals in the first two -year age groups = + = ); and the ( , ) element is divided by the same number, whereas the ( , ) element is divided by the number in the second age group = + = . dividing through gives the final matrix below (which is similar to table a , accounting for rounding in the illustrative calculation). to model the curves shown in figure , we used the estimated number of deaths from previous pandemic seasons ( , , and ). we compared those to the estimated clinical cases from the epidemiological model built for this exercise, using attack rates of both % and % (ie, the curves shown in figure , main text). all deaths were based on the clinical data reported during the specific pandemic season. however, in an effort to obtain current death estimates, we extrapolated the seasonal case values (either clinical data or number of deaths) into currentyear us cases at a total population of million. • influenza pandemic: we obtained from the source [ ] the weekly number of deaths in ( per people) for the reported different us geographic locations (west, east, and midwest/south). we then adjusted those number of deaths, per , to the approximate current us population of million persons (ie, multiplied each data point by ). this gave us the equivalent number of deaths for the us population. • influenza pandemic: we obtained the total, all ages biweekly (ie, reported every weeks) number of respiratory illnesses per from figure in the report of the cdc (then known as the communicable disease center) [ ] . we then adjusted those number of cases to the approximate current us population of million persons (ie, multiplied each data point by ). this gave us the equivalent number of cases for the us population. to obtain estimates of deaths in equivalent us population, we multiplied the estimates of cases by a case-fatality ratio of . (ie, . % of all cases result in death). this case-fatality estimate was taken from table in the main text [ ] . • influenza pandemic: we obtained the weekly reported number of pneumonia-influenza deaths in us cities from figure in sharrar et al [ ] . however, the total number of deaths recorded by sharrar et al was only , which is notably lower than what may be expected. we therefore used a multiplier of . to adjust upward their estimates. we constructed this multiplier by noting that meltzer et al's figure [ ] showed approximately deaths for a -type influenza pandemic occurring in the us population (ie, / = . ). • attack rates: we took the curves plotting the % and % clinical case attack rates shown in figure- of the main text (the plots assuming infectious persons start, or "seed," the pandemic in the united states). we then used a case-fatality rate of . (ie, . % of all cases result in death), taken from table in the main text [ ] . for simplicity, we assumed a low severity (scale of ) of . % for both attack rates to generate the number of deaths. human infection with influenza a(h n ) virus in china human infection with avian influenza a (h n ) virus-update epidemiology of human infections with avian influenza a(h n ) virus in china probable person to person transmission of novel avian influenza a (h n ) virus in eastern china, : epidemiological investigation novel framework for assessing epidemiologic effects of influenza epidemics and pandemics potential demand for respirators and surgical masks during a hypothetical influenza pandemic in the united states estimating the united states demand for influenza antivirals and the effect on severe influenza disease during a potential pandemic estimates of the demand for mechanical ventilation in the united states during an influenza pandemic estimating the potential effects of a vaccine program against an emerging influenza pandemic-united states social contacts and mixing patterns relevant to the spread of infectious diseases nonpharmaceutical interventions implemented by us cities during the - influenza pandemic the epidemiology of asian influenza. - . a descriptive brochure national influenza experience in the the economic impact of pandemic influenza in the united states: priorities for intervention for example, to construct the matrix element pertaining to the total number of contacts between the - -year age group and the - -year age group (ie, itself ), we perform the following sum: à : þ à : ¼ : this is the first diagonal element of the "total contacts" matrix and, again, it represents the total number of contacts made per day between those in the - -year age group.because diagonal elements are of course the same as their off-diagonal counterparts, there is no problem. . however, corresponding pairs of off-diagonal totals should be the same; that is, the total number of contacts between those in the - -year and - -year groups should be the same as the total number between those in the - -year and - -year age groups. y = n *d + n *d = * . + * . = y = n *d + n *d = * . + * . these total contact numbers are not the same and so we take the average of them and they become the ( , ) and ( , ) key: cord- - fjya wn authors: rogers, l c g title: ending the covid- epidemic in the united kingdom date: - - journal: nan doi: nan sha: doc_id: cord_uid: fjya wn social distancing and lockdown are the two main non-pharmaceutical interventions being used by the uk government to contain and control the covid- epidemic; these are being applied uniformly across the entire country, even though the results of the imperial college report by ferguson et al show that the impact of the infection increases sharply with age. this paper develops a variant of the workhorse sir model for epidemics, where the population is classified into a number of age groups. this allows us to understand the effects of age-dependent controls on the epidemic, and explore possible exit strategies. the global covid- pandemic has swept through the nations of the world with a frightening speed, and left governments struggling to cope with the situation. the initial responses have been directed towards limiting the death toll and ensuring that health services are not completely overwhelmed, as would be only too possible with an infection that can grow by a factor of one thousand in a month. as there is as yet no vaccine, no effective medication, and very imperfect understanding of the parameters of the epidemic, efforts have been directed towards containment, with decisions about return to normality being left until later. without vaccine or effective medical treatment, the only remaining strategies would appear to be either a policy of contact tracing and quarantining, or developing herd immunity. the first of these policies appears to have been applied successfully in south korea and singapore, and is generally regarded as the first line of public health defence. in the current pandemic, most countries have quickly found themselves overwhelmed by the scale and speed of the outbreak, and have been unable to apply contact tracing as rigorously and universally as is needed for the method to work. when it does work, contact tracing and quarantine will allow an outbreak to be snuffed out before it spreads widely, but it will of course leave a large population of susceptibles open to a new infection, so continuing vigilance is essential. as we have seen contact tracing overwhelmed, the goal of this paper is to explore the route to herd immunity, using age-dependent release from lockdown, and a gradual relaxation of social distancing rules. in section we present the model, which is in almost all respects a straightforward variant of the standard sir epidemic model. the equations contain terms for the controls which are available to modify the dynamics of the epidemic. the problem is a control problem, and for this we have to define the objective, which we do in section . the issue is of course that we have a conflict between the obvious cost of the numbers of citizens whose lives are ended prematurely, which is a concern for the next few months; and the damage that an extended lockdown will do to the economy, which will be a concern for many years if the aftermath of the financial crash is any guide. in setting up the cost structure, some relatively arbitrary (but hopefully reasonably realistic) assumptions have to be made; these are not in any way essential to the approach, and can easily be changed by any reader prepared to play with the jupyter notebook posted online . parameter values, or even the entire form of the costs, can be changed by anyone with a little knowledge of python. experts in health economics would doubtless be able to suggest values that better embody current thinking, and before any of the results of this paper can be relied on, such inputs will be necessary. in section we briefly discuss the datasources used, and in section we present the results of computation in various scenarios. a simple sir epidemic model is too crude to allow us to model and control the key features of the covid- epidemic; many infected individuals are asymptomatic, and the impact of the infection on different age groups is very different. so we will break down the population into j age groups, and let a j (t), i j (t), s j (t) denote the numbers of j-individuals at time t who are (respectively) asymptomatic infected, symptomatic infected, and susceptible. we will denote by n j (t) the total number of j-individuals in the population at time t, and allow this to change gradually with the influx of new births, visitors from other countries; this is to model the possibility that new infecteds come in from outside and reignite the epidemic. the most basic form of the evolution is governed by the differential equationṡ where ι j and σ j are known functions of time representing the arrival of new asymptomatic infec- https://colab.research.google.com/drive/ tbb usgia wehy-hviygdo mpnzu a tives and susceptibles respectively ; and the final term on the right-hand side of ( ) allows for the possibility that removed infectives may not in fact be immune, and some may return to the population ready for reinfection. the parameter p ∈ ( , ) appearing in ( ), ( ) is the probability that a susceptible becoming infected is symptomatic; and the parameter ρ > is the recovery rate. the infection rates λ j (t) are explicit non-linear functions of the state of the system that will be discussed shortly, but, aside from the terms involving λ, the evolution is linear. so if we stack the variables into a single vector the evolution ( )-( ) can be written aṡ where m is a j × j constant matrix, Λ is a simple matrix whose entries involve the λ j in the appropriate entries, and η is the vector of driving terms. [remark.the model ( )-( ) is the fluid limit of a markov chain model in which ρ is the rate that an individual jumps from an infected state to the removed state, and therefore the implicit (markovian) assumption is that the time spent in the infective state is exponentially distributed. this assumption does not fit well with observation, so we can allow for different distributions by the familiar trick of the method of stages (see, for example, [ ] ), in which a infected individual passes through a number of exponentially-distributed stages. in more detail, we can suppose that there are k x stages for the symptomatic infection, and that i k j (t) is the number of symptomatic j-individuals at stage k of the infection at time t, j = , . . . , j, k = , . . . , k x . making this change, the equation ( ) becomes the systemİ this corresponds to making the duration of symptomatic infection a sum of k x independent exponentials each with mean /ρk x , which has the same mean as an exponential of rate ρ but smaller variance. we could similarly decompose the asymptomatic infections, and indeed by further ramifications of the method of stages we could make the distribution of infected time approximate any desired distribution. there is a good reason not to take this too far, however; in the numerics, the differential equation has to be solved many times. it is remarkable that this can be done in a reasonable amount of time, but the more complicated the model, the slower this step becomes and ultimately the computation will be too slow. but however we do this, when we stack all the variables into a big state vector z, the evolution still has the form ( ), and the appropriate form of this is coded into the jupyter notebooks.] each individual spends part of the waking day at home, and part of the waking day outside . we shall denote by m o ij the mean number of contacts that an i-individual has per day with j-individuals when outside the home; and by m h ij the mean number of contacts that an i-individual has per day with j-individuals when inside the home. it is important to understand that m o ij is the mean number of contacts that an i-individual has with j-individuals if everyone spends their entire waking day outside the home. if the i-individual spends a fraction ϕ i of the waking day outside the home, and j-individuals spend a fraction ϕ j of the waking day outside the home, then the mean number of contacts per day which an i-individual has with j-individuals outside the home will be ϕ i m o ij ϕ j . each time an infected person has contact with someone, infection will be transmitted with probability β, though of course this will only result in a change if the person contacted was susceptible. thus the overall rate at which infection is passed in the outside world to j-individuals will be where ϕ i (t) is the fraction of time spent in the outside world by i-individuals, and δ ∈ [ , ] is the proportion of symptomatic infecteds who go into the outside world. in an ideal situation, this would be close to zero, but many people with the infection get only mild symptoms and may not self-isolate. the number of j-individuals at time t is n j (t), so the factor s j (t)/n j (t) on the right-hand side of ( ) is the probability that a contacted j-individual is susceptible. this may have the appearance of a conventional extension of an sir model, but one point to flag straight away is that the controls ϕ enter quadratically in the expression for the infectivity, whereas some authors use only a linear dependence. this is erroneous. what happens in the home is rather more difficult to deal with. we could simply take ( ) and change superscript o to h, and ϕ to − ϕ, but this would be incorrect, because an infected individual outside may go through the day and infect a large number of people, but within the home there are relatively few that could be infected, so the scope to spread infection is much reducedthis is after all the rationale for locking down populations. without the constraint on the number of infections imposed by the household size, a single infected i-individual in the home would be firing infection transmissions at j-individuals at rate thus if τ is the mean infective time, during the period of infectivity each infected i-individual in the home will fire a poisson(γ ij (t)τ ) number of infections towards j-individuals, and therefore will fire a possible z ∼ poisson(γ i (t)τ ) number of infections towards all others, where γ i (t) = j γ ij (t). however, the number of infections that can strike another individual cannot exceed n − , where n is the size of the household in which the infected i-individual lives. data from the office of national statistics allow us to deduce the distribution of n . the mean number of individuals at whom the infected i-individual fires infections is this is the mean number of infections the infected i-individual could fire at others during a period of mean length τ , so we will simply suppose that while infected the i-individual in the home will be firing infections at rate µ i (t)/τ . an infection fired at another will be supposed to strike a j-individual with probability p h ij (t) proportional to m h ij ( − ϕ j (t)); and given that it strikes a j-individual, the probability that a new infection results will be s j (t)/n j (t). thus the analogue of ( ) for new infections of j-individuals in the home will be we combine these to give finally [remark. these assumptions represent a compromise; any honest treatment of what goes on within households would appear to require a decomposition of the population into groups according to different household compositions by age, meaning the size of the statespace gets out of controlwhich would render the calculation impractical. ] it is worth emphasizing that there are just four controlling parameters in this model: β, the probability that a contact results in a transmission; p, the probability that an infected person is symptomatic; ρ, the reciprocal of the mean infective time; ε, the probability that a removed infective is still susceptible. other values which are needed for the calculations, such as the mean numbers m o ij , m h ij of contacts, can be found from published estimates. there are three components to the cost: the cost of lockdown, the cost of social distancing, and the cost of deaths. we take them in turn. there will be a normal levelφ j for the proportion of time spent by a j-individual outside the home; for the purposes of the computations, the assumption here is that of the waking hours of the week, are spent in school or work, are spent in social activities, and are spent at home, makingφ j equal to / for all age groups. if a j-individual is locked down at level ϕ j (s) at time s, we propose that the cost by time t should be proportional to for constant ϕ j , this will be convex in t, which seems realistic; a short lockdown (as for a public holiday) causes little damage, but as the time away from regular work stretches on, the damage suffered increases more rapidly, as businesses collapse and workers are made redundant. we will consider strategies where for some < u < v (which may depend on j) where ϕ j ( ) is the initial level of lockdown applied. at time u, this starts to be relaxed in a linear fashion, being fully relaxed by time v. integrating ( ) up to v gives the cost of a j-individual being locked down as for some constant c. if we think that the social cost of an individual being locked down for one year is sc , then the constant of proportionality in ( ) is fixed so that the cost will be this then has to be summed over all the members of the population, with a small reduction for retired people, who would presumably impact the economy less if they were prevented from going out. in the numerical implementation, we fix v = u + ; this reduces the number of free parameters, and in any case reflects the realistic situation that once an age group is freed from the lockdown restrictions they will quite quickly get back to normal activity. social distancing imposes costs; public transport will have to run at reduced capacity, as will restaurants and theatres. but these costs are steady ongoing frictions which do not keep people away from work for months on end. if the social distancing policy means that at time t the number of contacts outside the home are reduced to a fraction sd(t) ∈ ( , ) of the normal situation, then we propose that the cost of this policy by time t would be proportional to the form of sd is available to choose, and in the computations we suppose that sd rises from the initial value sd to final value in a piecewise-linear fashion through n sd stages. this allows for the possibility that social distancing could be gradually relaxed by opening more and more classes of business or public assembly. thus at some time u , sd starts to rise to the first staged value sd at time v , where it remains until u ; from there it rises to the next staged value sd at time v , and so on. we suppose that the levels of the stages are equally spaced, but this can easily be altered. if there is just one stage, the policy starts at some value sd and at time u starts to rise linearly to at later time v, so the overall cost will be by considering the effect of social distancing for a year, we fix the constant to give cost where θ sd ∈ ( , ) expresses the pain of social distancing relative to lockdown. in the calculations, we will allow the profile of social distancing to be a more general piecewise-linear continuous functions, permitting social distancing to be relaxed in stages and held at intermediate values. making an estimate of the cost of the death of an individual is ethically and procedurally quite a vexed issue. for the purposes of the calculations reported in this paper and as default values used in the jupyter notebook, the assumption is that the cost of the death of an individual is proportional to the expected number of further years that they would have lived; and that the constant of proportionality is of the same order as sc , the cost of an individual being locked down for one year. so the code has a parameter deathfactor which is used to scale sc for the calculations. this is only part of the story however. we need to calculate the number of deaths which will result from any particular policy, and this comes from the calculated stream of removed symptomatic infectives, coming at rate ρi j (t) in age group j. most of these will have recovered, but a percentage of these will need hospitalization, and of those a percentage will need critical care. the probabilities depend on the age of the patient, with older patients at much higher risk; estimates are given in [ ] and are quoted in [ ] . so we calculate the rate at which new critical care beds are required. based on an estimate for the number of days a critical care patient needs a bed (taken to be days), and knowing the total number of available critical care beds, we can keep a running count of the number of critical care beds in use, and then see how many of the incoming patients for critical care can be accommodated. those who can be accommodated survive with probability p cc (taken to be . ); those who cannot are assumed to die. it is assumed that younger patients always take priority in allocating limited resources. the code is built around the data assumptions in [ ] , who use nine age groups, - , - , - , - , - , - , - , - , and +. the probabilities of hospitalization and critical care need for these age groups are estimated by verity et al. [ ] . the population numbers for these age groups come from the statista web site (https://www.statista.com/topics/ /uk/). the number of critical care beds in england at the end of was around , with around more planned at the emergency nightingale hospitals, so as an optimistic figure we took to be the number. the mean infectious period was taken to be days, in line with values in [ ] , but it seems this can be highly variable. various values were tried for p, the probability of an infected person being symptomatic, but the baseline for this parameter was . . infectivity was taken to be . , in line with values proposed by [ ] , but again there appears to be quite a wide range of possible values, as we see from [ ] . the contact matrix values m o , m h are derived from [ ] ; they work with different age ranges, so some pre-processing of their data had to be done; the code for this is available from the author on request. the code for the calculations was written in python, and is available in the jupyter notebooks for the reader to scrutinize and experiment with. the first approach was to take the objective and minimize this using the scipy routine minimize, which acts as a wrapper to fourteen different methods, only a few of which were possibilities due to the constrained nature of the problem. the only routine which managed acceptable runtimes was slsqp, but it turned out that for virtually all randomly-chosen starting points, the end point was the same as the start point; so this suggested the method which is used in the jupyter notebook, which is simply to randomly generate control rules of the form discussed above, and focus on those which do best. it is of course impossible to present more than just a few cases, but we can explain what the default values for all the relevant parameters are, and then show how the outputs vary as some of as initial values, we assume there are asymptomatic infecteds in each of the age groups, and the initial vector ϕ is ϕ = [ , , , , , , , , ] * φ/ the costs of lockdown are supposed to be less severe for the older age groups, so we use qcost = [ , , , , , , . , . , . ] * sc as mentioned before, we took the number nbeds of critical care beds to be . we ran the calculation for days (except in the do-nothing example, which ran for days). we insisted that lockdown ends for all but the oldest age group ( +) by day , and we imposed the condition that social distancing reaches its end value s end by day . in this base case, we shall take sd = . and ϕ =φ, which is the situation where no social distancing and no lockdown happens. there are , deaths, and using the proposed cost parameters, the cost of deaths is bn, the cost of social disruption is . bn. in this scenario, the epidemic is short and massive; as we see from figure , everything is over in about days, with a peak number of new daily cases for critical care of , , and for hospital admissions of almost , . figure shows that the critical care provision is completely swamped, with nearly , critical care cases unable to get a critical care bed and therefore dying without the necessary care. it is hard to imagine how such a scenario could be thought acceptable. in this scenario, fairly tight lockdown and social distancing measure are applied from the beginning and gradually relaxed. the costs of lockdown and social distancing this time amount to bn, the death costs to . bn, and the total number of deaths was , . the load of new cases is much more manageable, with a peak of just over , new critical care cases, and about , new cases in all. all but the two oldest age groups are out of lockdown within days, but looking at figure we see that even after days the epidemic is far from over; once the oldest group is let out of lockdown and social distancing has come to an end, the epidemic starts to take off again. most worrying here is that from days on, every single critical care bed is taken by a covid- patient, and thousands of elderly patients needing a critical care bed are unable to obtain one. this supports the proposition that some form of social distancing will have to be maintained for a very long time if no treatment or vaccine can be found. next we see what happens if in fact the infectivity is higher than the middle case value of . suggested in [ ] . this time, lockdown and social distancing costs remain at around bn, death costs are about . bn, and the total number of deaths is , . the general picture looks like the previous situation but more accentuated; there is a clear second surge after the oldest age group is released from lockdown, and some , die without the critical care they need as the hospitals are submerged with cases. this time, saturation of the critical care facilities begins around day and keeps going. even maintaining social distancing at % is not sufficient to hold back the epidemic in the longer run. if the probability that an infective is symptomatic is reduced to . , the outcome is improved, with death costs around . bn, lockdown costs little changed, and total deaths reduced to , . figure shows two pronounced peaks to the infection, the second again coinciding with the final relaxation of restrictions. the critical care capacity only saturates at around day this time. the epidemic is on a smaller and more manageable scale; peak admissions to critical care are just over , peak hospital admissions just over . this is not surprising, since the proportion of those infected who are symptomatic (and therefore open to possible complications) is lower. however, there are more undetected asymptomatic infecteds going about in the population, so the number of deaths is higher than in the base case; it is clear from the pictures that towards the end the epidemic is beginning to get out of control. in this scenario, we find the costs of lockdown and social distancing to be reduced to bn, death costs around . bn. the number of deaths is , . what is most clear from figure is that from the time that the - age group is released from lockdown around day , the epidemic gradually gets more out of control, with critical care at full stretch from day onwards, and the numbers of older patients needing critical care and dying without it growing , at the end of the run. of course, it is only possible to display a few examples, which barely begins to explore the diversity of behaviour that will arise as parameters are varied. this is the purpose of the jupyter notebook which can be found at https://colab.research.google.com/drive/ tbb usgia wehy-hviygdo mpnzu a conclusions. this paper offers a simple model for the current covid- epidemic; no account is taken of spatial effects, which could make a big difference to any conclusions. the treatment of the spread of the infection in the home is an approximation, plausibly based perhaps, but still an approximation. nevertheless, the modelling assumptions are simple and compact, and permit rapid exploration of possible responses of a non-pharmaceutical nature. the calculations require assumptions about the initial state of the epidemic which are essentially guessed. even coming into the epidemic once it is under way, it would be hard to get reliable values for the numbers of asymptomatic, susceptible and immune people in the population, not least because there is at the time of writing no test to determine whether someone has had the infection and is now immune, and only a rather unreliable test whether an individual currently has the infection. no account is taken of parameter uncertainty. this is a natural area of enquiry, but at the moment it seems that the data that would support strong conclusions is not yet available. as it seems that the key parameters are known with very little precision, a highly detailed model, or a sophisticated story about statistical inference may help less than some rough exploration of possible parameter combinations; as the epidemic evolves around the world, we will undoubtedly learn more of its characteristics, which will allow us better to control it. infectious diseases of humans: dynamics and control networks of queues and the method of stages impact of non-pharmaceutical interventions (npis) to reduce covid mortality and healthcare demand early transmission dynamics in wuhan, china, of novel coronavirus-infected pneumonia projecting social contact matrices in countries using contact surveys and demographic data age-structured impact of social distancing on the covid- epidemic in india social contact patterns and control strategies for influenza in the elderly estimates of the severity of coronavirus disease : a model-based analysis. the lancet infectious diseases it is a pleasure to thank josef teichmann, kalvis jansons, ronojoy adhikari, rob jack, philip ernst and mike cates for illuminating discussions. as economists will insist on noting, they are not responsible for the errors herein. key: cord- -c c op authors: cheng, yung-hsiang; chang, yu-hern; lu, i.j. title: urban transportation energy and carbon dioxide emission reduction strategies() date: - - journal: appl energy doi: . /j.apenergy. . . sha: doc_id: cord_uid: c c op sustainability is an urban development priority. thus, energy and carbon dioxide emission reduction is becoming more significant in the sustainability of urban transportation systems. however, urban transportation systems are complex and involve social, economic, and environmental aspects. we present solutions for a sustainable urban transportation system by establishing a simplified system dynamics model with a timeframe of years (from to ) to simulate the effects of urban transportation management policies and to explore their potential in reducing vehicular fuel consumption and mitigating co( ) emissions. kaohsiung city was selected as a case study because it is the second largest metropolis in taiwan and is an important industrial center. three policies are examined in the study including fuel tax, motorcycle parking management, and free bus service. simulation results indicate that both the fuel tax and motorcycle parking management policies are suggested as potentially the most effective methods for restraining the growth of the number of private vehicles, the amount of fuel consumption, and co( ) emissions. we also conducted a synthetic policy consisting of all policies which outperforms the three individual policies. the conclusions of this study can assist urban transport planners in designing appropriate urban transport management strategies and can assist transport operation agencies in creating operational strategies to reduce their energy consumption and co( ) emissions. the proposed approach should be generalized in other cities to develop an appropriate model to understand the various effects of policies on energy and co( ) emissions. sustainability is an urban development priority. thus, energy and carbon dioxide emission reduction is becoming more significant in the sustainability of urban transportation systems. however, urban transportation systems are complex and involve social, economic, and environmental aspects. we present solutions for a sustainable urban transportation system by establishing a simplified system dynamics model with a timeframe of years (from to ) to simulate the effects of urban transportation management policies and to explore their potential in reducing vehicular fuel consumption and mitigating co emissions. kaohsiung city was selected as a case study because it is the second largest metropolis in taiwan and is an important industrial center. three policies are examined in the study including fuel tax, motorcycle parking management, and free bus service. simulation results indicate that both the fuel tax and motorcycle parking management policies are suggested as potentially the most effective methods for restraining the growth of the number of private vehicles, the amount of fuel consumption, and co emissions. we also conducted a synthetic policy consisting of all policies which outperforms the three individual policies. the conclusions of this study can assist urban transport planners in designing appropriate urban transport management strategies and can assist transport operation agencies in creating operational strategies to reduce their energy consumption and co emissions. the proposed approach should be generalized in other cities to develop an appropriate model to understand the various effects of policies on energy and co emissions. Ó elsevier ltd. all rights reserved. sustainable development has become a worldwide priority. sustainable development is viewed as the development that meets the current needs without compromising the ability of future generations to meet their own needs [ ] . the transportation sector is important as it relates to sustainability because this sector supports the economy and most social activities and has substantial environmental impact [ ] . thus, a well-established urban transportation system should not only harmonize economic growth with land-use planning and promote the use of public transit systems but also conserve resources and be environmentally friendly [ ] [ ] [ ] . according to key world energy statistics [ ] , the aggregate energy demand of the global transport system increased from % in to % in . the world energy outlook [ ] reported that the transportation sector will account for % of the growth in petroleum consumption between and . this finding indicates that the increasing use of motor vehicles will accelerate resource exhaustion and global warming, despite its promotion of road transportation mobility. in taiwan, the road transportation system not only facilitates the mobility of people and goods over space and time but also is essential for the industrial and economic development of taiwan's trade-oriented economy. according to taiwan's statistical abstract of transportation and communications [ ] , the number of registered vehicles in taiwan rose from . million in to . million in september . this rise was a consequence of the increase in individual disposable income, the opening of the first national north-south expressway in , and the subsequent improvement of the highway infrastructure: a second national north-south expressway, a west coast highway, and an east-west highway, among others. along with the rapid growth of the number of motor vehicles, energy consumption in the road transportation sector reached the equivalent of , kl of oil in , which was . times higher than that in , and accounted for . % of aggregate transport fuel demand. the amount of co emitted in the road transportation system increased at an annual rate of . % per annum, from . million tons in to an estimated . million tons in . under the pressure of global warming and significant great fluctuations in fuel prices, we face issues related to humanity-oriented transportation, energy conservation, and co mitigation, which have already become important topics in transportation planning and management. the ministry of transportation and communications (motc) in taiwan had invested nt$ billion from to to reduce the number of private vehicles driven and the amount of fuel consumption and co emissions through the use of public transportation promotion programs. many academic works focused on co emission and energy consumption in the urban system context [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] . however, interactions between various transportation subsystems are not considered. moreover, a systematic approach covering more aspects of the urban air pollution problem is still lacking to examine the effects of various transport policies. an urban transportation system is complex and involves a variety of social, economic, and environmental issues. interpreting the inherent mechanisms of the system and capturing the dynamic behavior of the components with analytical methods, such as decomposition analysis, grey theory, least-squares regression, and geometric average method, are not easy because the database is limited and because these subsystems are interlinked and dependent on each other. system dynamics (sd) provide a simulation platform to analyze a large-scale and complex socioeconomic system with multiple variables that change over time. with the aid of an sd model, we selected kaohsiung as a case study to explore the effects of variations in demographics, fuel prices, and economic growth rate, among other factors, on the number of vehicles, fuel consumption, and energy-related co emissions. in addition, we developed three scenarios based on the possible policies that could be adopted by the city government to simulate their potential both for reducing the vehicular fuel consumption and for mitigating co emissions in kaohsiung. sd, which is based on systems theory, is a method for analyzing complex management problems with cause-effect relationships among different systems. industrial dynamics [ ] was the first book to illustrate the influence of organizational structure, policies, and action delays on industrial activity. an urban dynamics model was then constructed to show the effects of the interactions among business, housing, and people on the growth pattern of a city. finally, a large and complex socioeconomic simulation system, i.e., world dynamics, was developed [ ] . the world socioeconomic system might collapse if actions are not taken to slow population growth and the continuous and unrestrained exploitation of natural resources. in recent years, the sd model has been widely used to analyze agricultural systems [ , ] , environmental management and planning [ , ] , industrial sectors [ ] [ ] [ ] [ ] [ ] [ ] [ ] , strategy planning and decision making [ ] [ ] [ ] , transportation systems [ ] [ ] [ ] [ ] [ ] , ] , urban planning [ ] [ ] [ ] [ ] [ ] , waste management [ ] [ ] [ ] [ ] [ ] , and water resources and lake eutrophication [ ] [ ] [ ] [ ] . the transport mode for distributing goods in germany was explored with the aid of an sd model [ ] . in addition, policy interventions such as infrastructure investments, and carbon tax were simulated to examine their effects on energy savings, co reduction, public expenditure, and economic development. an sd model was used to evaluate the influence of the traditional supply chain and the vendor-managed inventory system on the performance of a firm's supply chain [ ] ; to examine the effects of policy scenarios on traffic volume, modal share, energy conservation, and co mitigation [ ] ; and to investigate how incorporated systems, such as population, economy, transportation demand, transportation supply, and the vehicular emission of nitrous oxides, affect the dynamic development of urban transportation systems under five policy interventions on vehicle ownership [ ] . an sd model was developed to explore the interrelationships among population, economy, housing, transport, and urban land in hong kong; the long-term constraints of and potentials for urban development yielded by the study were offered as policy suggestions for city planning [ ] . previous relevant studies rarely considered interactions among various transportation subsystems simultaneously with co emission and energy consumption. although certain developed countries, such as the united states, the united kingdom, and members of the european union, have focused on improving fuel efficiency using advanced technologies [ , ] , few studies have developed a practical sd approach for urban planners to further assess the effect of urban transportation policies on energy consumption and co emission. this study examines three main urban transportation policies in our proposed model: fuel tax, motorcycle parking management, and free bus service. prior studies mainly investigate the effect of a particular policy, such as fuel tax in europe and the us [ ] ; parking management policies in china [ ] ; and free bus policy in japan, belgium, and england [ ] [ ] [ ] . a limited number of studies analyze the effect of these policies on energy consumption and co emission reduction simultaneously and compare the respective policy with the synthetic policy to compare the policy effectiveness. our study aims to fill this research gap by developing a systematic and simplified analytical tool that can help urban planners to evaluate the influence of various transportation policies on energy consumption and co emission reduction. briefly, an sd model describes the information, structural boundaries, strategies, and action delay inside the system structure through a feedback process. a quantitative simulation is performed to study the dynamic behavior of the interaction of interrelated components inside the system structure. the sd model analyzes a complex system with multiple variables that change over time and determines how the system is affected by the implementation of specific policies [ ] . in addition, kummerow [ ] revealed that the sd model not only relatively easily incorporates qualitative mental and written information as well as quantitative data but also can be used when the database is insufficient to support statistical forecasting analysis. thus, the sd model is an appropriate approach to display the inherent behavior and influences inside the system structure despite multidirectional dynamic interactions and the fact that life is infinitely more complicated and difficult than we can effectively simulate [ , , , , , ] . although an sd model is an appropriate approach to simulate a complex and multidirectional dynamic system by constructing mathematic functions, it is a subjective and time-consuming operation. the causal relationships of the sd model are based on the subjective judgment of the operator, reference suggestions, data availability, and information acquisition. thus, the simulation result will change if the operator adopts different stock and flow variables. in addition, error analysis based on historical statistical data should be evaluated to ensure that the forecasted results are accurate and efficient and that the causal relationships used are reasonable. an sd model contains two parts. the first part is a causal-loop diagram that describes an idea, both conceptually and as a set of simplified cause-effect relationships between the different systems developed during model construction. the second part is a stock-flow diagram that represents the quantitative relationships among variables. a more detailed description follows. the relationships of real urban transportation systems are not likely to be simple, but the sd model offers an opportunity to show how interrelated variables in a system affect one another by arrows. a plus or minus sign indicates the direction of the variations between two variables: the ''+'' sign indicates that a change in one variable causes another variable to change in the same direction, and the ''À'' sign indicates that one variable causes another to change in the opposite direction. fig. shows the causal-loop of the sd model for an urban transportation system (more explanation is described in section ). a stock-flow diagram has four components: stock, flow, auxiliary variables, and arrows (appendix a). the stock variables are represented by labeled rectangles, e.g., ''individual disposable income'' and ''urban population.'' each stock variable accumulates all the values that flow into and out of it (indicated by the thick heavy arrows pointing from and to the stock variables, such as ''increases in individual disposable income'') and reflects the condition within a system at a specific point in time. stock variables can be changed only through flows. thus, the value of the stock variable is controlled by the pipes (the thick heavy arrows with a valve in the center and a cloud symbol at the end) pointing into or out of the stock variable. a flow variable refers to the rate of changes over a certain interval of time. an auxiliary variable is an intermediate variable used to show the informational transformation process, the environmental parameter values, or the systematic test functions or values. the causal relationship between variables is depicted by the curved blue arrows. the city of kaohsiung in southwestern taiwan comprises an area of , ha ( . square miles or . square kilometers). kaohsiung city is the second largest metropolis in taiwan and offers air, land, rail, and sea transportation. air and sea transport traffic determines the industrial structure share and the scope of city development. the kaohsiung harbor is an important transport point for the taiwan straits and the bashi channel. the kaohsiung international airport has airlines flying worldwide through air routes. kaohsiung is not only important for the import and export businesses in taiwan but also taiwan's industrial center because of the predominance of the international harbor and airport. heavy industries, such as steel-making, refining, shipbuilding, and those involved in the manufacture of petrochemicals and cement, as well as two export-processing zones in kaohsiung and neighboring nantse have significantly accelerated the diversity of local industrial activities and turned kaohsiung into the most important industrial and commercial center in southern taiwan. the population of kaohsiung rose from . million in to . million in . with the urbanization and internationalization of kaohsiung, individual disposable income has also increased: in , it was . % higher than in , with an average annual growth rate of . %. the number of motor vehicles in the city grew at an annual rate of . % over the past years, reaching . million in . among the . million vehicles, . % and . % represent the number of private cars and the number of motorcycles, respectively. the percentages of light trucks, heavy trucks, and city buses were . %, . %, and . %, respectively. vehicle ownership rates for private cars and motorcycles were and vehicles for every people. in this study, the sd model includes seven subsystems: urban population, individual disposable income, private cars, motorcycles, light trucks, heavy trucks, and city buses (appendix a). the size of the human population is the foundation of a city's development, and issues such as the growth rate in the number of motor vehicles, vehicular energy consumption, and co emissions are derivatives of the interaction between human population and economic activities. based on this assumption, the subsystem of the transportation mode and the related variables [i.e., vehicle kilometers of travel (vkt), vehicular fuel efficiency, transfer ratio among modes, emission coefficient, and other factors] were added to the model after the dynamic behavior of urban population and individual disposable income had been determined. furthermore, using commercial simulation software (vensim ; ventana systems, inc., harvard, ma), the causal relationships between the various components within the system were simulated from to . vensim is herein used to develop, analyze, and package highquality dynamic feedback models. models are constructed graphically or in a text editor. features include dynamic functions, subscripting (arrays), monte carlo sensitivity analysis, optimization, data handling, application interfaces, and more options. vensim is an interactive software environment that allows the development, exploration, analysis, and optimization of simulation models [ ] . fig. shows the causal-loop of the sd model for the urban transportation system. economic growth increases the number of motor vehicles and attracts more migrants from other cities. the amount of energy requirement and co emission will rise as the number of motor vehicles increases. however, the increase in the amount of co will reduce the growth rate of the urban population. simultaneously, the number of motor vehicles will decrease with the reduction of urban population. economic development affects population, wherein economic growth leads to an increased number of vehicles [ ] . therefore, we assume that economic growth positively affects the number of motor vehicles and population growth. moreover, fuel use can positively influence co emission [ ] . we thus assume that energy consumption can positively affect environmental issues (co emission). the number of private vehicles and buses positively affects traffic congestion and energy consumption [ ] . we assume that the number of city buses and the number of motor vehicles both positively affect traffic density and energy consumption. in addition, the tax policy on fuel can reduce the fuel consumption of motor vehicles, leading to reduced co emissions [ ] . traffic density is also a significant and robust predictor of habitant survival, more so than ambient air quality [ ] . we assume the association between traffic density and co emission as well as the existence of the negative effect of traffic density on population growth. in fig. , the p values of each variable are less than . , which indicates statistical significance. in kaohsiung, one out of two residents own motorcycles, whereas one out of three residents own cars. these residents are accustomed to the convenience, independence, and flexibility that are provided by private vehicles. energy consumption and co emission issues are mainly derived from private vehicles. therefore, we mainly focused on the influence of private vehicles in our case. less than % of the population uses the taxi service because of the higher charge compared with other forms of transportation [ ] . moreover, the uber platform service is not currently allowed to be officially operated by the authority, and this service is thus less popular in kaohsiung. road motor vehicles account for the relative large co emission and energy consumption ( . %) compared with the metro system that, having its own electrification system, supplies electric power for movement without a local fuel supply [ ] . therefore, the other possible variables, including vehicle technology, emission legislation, and automobile park age, are not currently emergent in this stage for kaohsiung and can be further examined in future studies. contributions of our study include the use of a systematic approach to examine energy and co emission reduction by implementing various transport policies in the urban transportation context. adopting this proposed approach is useful in other cities, and their specific features should be considered to develop an appropriate model to understand the various effects of policies. we considered two equations to explain the effects of individual disposable income and an economically active population on the number of motorcycles in the main text. linear least-squares regression analysis was performed to reflect the effects of individual disposable income and an economically active population on the variation of private cars and the sd model is an approach to understand the behavior of complex systems over time. sd can estimate fuel price while considering the time effect. therefore, we used delay i to consider this time effect. this delay function can be used in the equation as with normal sd modeling. this function is frequently used in sd for modeling postponed effects. we adopted the delay function to model the effect of fuel price on the relationship between the decrease in private car use and the increase in fuel price. the postponed effects of fuel price are considered in our model by using the delay function. the decrease in private car use is associated with the increased fuel price that can be estimated by the following formula: decrease in private car use ¼ private cars  ð : %= : % à ðfuel price À delay i ðfuel price; ; ÞÞ=delay i ðfuel price; ; ÞÞ; where . % is the probability that the number of private cars that will decrease when fuel prices increase by . % the size of the human population not only reflects the scale of urban development but also drives the transport demand. the size of the urban population was selected as a stock variable, and natural changes in the population and changes caused by social migration were selected as flow variables because the size of the human population is affected by both natural changes in the population and changes caused by social migration. the formulation of the natural changes in the population is expressed as the product of the human population and the natural population ratio per year, where the natural population ratio is adopted from the statistical yearbook of kaohsiung city [ ] . in addition, individual disposable income, traffic density, and aggregate co emissions were considered in this study to control the variations in social migration. economic performance is an important index for evaluating the competitiveness of a city. if individual disposable income grows, then the number of migrants and the number of motor vehicles increase; otherwise, these values decrease. therefore, individual disposable income was chosen as a stock variable dependent on the growth ratio of the gdp [ ] . furthermore, the prediction of the global insight database [ ] on the future gdp growth ratio of taiwan was reduced by . % to avoid an overestimate. several studies have indicated that the number of motor vehicles, as well as the number of new vehicles purchased, is closely associated with economic growth and population [ ] [ ] [ ] [ ] [ ] . we analyzed this subsystem to evaluate the effect of the changes in the level of the individual disposable income and in the size of the economically active population on the variation of the number of private cars. according to the survey of motc, the number of private cars driven decreased by . % after fuel prices increased by . %. thus, the effect of fuel prices on the variations of automobile use was also considered because of the rise in the price of crude oil, which has almost doubled since the beginning of . the auxiliary variable energy consumption by private cars was calculated by multiplying the number of private cars, vkt, and the inverse of the average vehicular fuel efficiency (km/l), where the value of vkt and fuel efficiency were obtained from the taiwan emissions database system (teds) . . estimations of energy-related co emissions were determined by the product of vehicular energy consumption and its emissions coefficient, published by the intergovernmental panel on climate change (ipcc). the reason for the omission of electric vehicles in the model is that the electric vehicle technology in taiwan is currently in the early stages. moreover, the central and city government did not provide strong incentives, including direct subsidies, fiscal reduction, and regulatory policy, to increase the use of electric vehicles. in terms of user perspectives, short driving range and slow speed of electric vehicles lead to the less popularity of electric vehicles in taiwan. even when considering improvements in the fuel economy of vehicles, we observe that the dense traffic in kaohsiung requires the car to stop and go frequently. thus, fuel economy improvements of vehicles are not significant in the local context. therefore, the policy imposing fuel tax seems to remain useful in kaohsiung. the kaohsiung mass rapid transit (kmrt) system opened for service in . the system not only provided a new lifestyle for citizens but also reduced the number of private vehicles used for commuting to work. to reflect the influence of the kmrt on vehicular fuel consumption, the transit system was also incorporated into this subsystem. specifically, the decrease in the number of commuters who switched from using private cars to using the kmrt was estimated by combining the average number of passengers carried by the kmrt, the average number of kilometers per passenger trip, the average number of occupants per automobile, and the transfer ratio. thus, the effect on vehicular energy consumption was calculated by multiplying the decrease in private car use and the average vehicular fuel efficiency of private cars. motorcycles accounted for . % of the . million registered vehicles in kaohsiung in ; motorcycles provide greater mobility and are less expensive than other types of motor vehicles. in this study, the increase in the number of motorcycles was primarily driven by individual disposable income and the size of the economically active population [ ] [ ] [ ] [ ] . as mentioned previously, fluctuations in fuel prices affect the number of private vehicles driven and the distances that they are driven. hence, the effect of fuel prices on mode choice and mode transfer was further incorporated into the model through the operation of flow variables: mode transfer from private cars and the decrease in motorcycle use by fuel price increase (appendix a). in addition, the formulation of vehicular fuel consumption and associated co emissions derived from the number of motorcycles was the same as that from the number of private cars, but the average occupancy rate (the number of passengers per motorcycle) and the transfer ratio between the kmrt and motorcycles were different. as the world's thirteenth ( ) largest international port and the largest industrial center in taiwan, the city of kaohsiung is important both for freight transportation and for industrial and commercial activities. many cargos and freight need to be transported to the northern metropolitan areas because kaohsiung is located in the south of taiwan and is a harbor city. after exiting the harbor, some heavy trucks need to pass the city area to the highway system. this phenomenon is reason for the consideration of heavy-duty trucks in our model. the dynamic behavior of this subsystem is analyzed through the operation of a stock variable: the number of heavy trucks; a flow variable: the increase in the number of heavy trucks; and two auxiliary variables: the effect of gdp on heavy truck function and the growth ratio of gdp. the growth of freight transport demand is primarily a consequence of the growth of economic activity [ ] [ ] [ ] [ ] ] . hence, the growth of gdp was selected as the motivational factor in this study to reflect the variation in the number of heavy trucks. the auxiliary variable effect of gdp on the heavy truck function was constructed based on the concept of a table function, which is a graphical tool that captures the causal and non-linear relationship between two variables. business activities and commercial services, such as food markets, street vendors, bazaars, superstores, cargo carriers, and other such entities, are closely linked with the number of light trucks. thus, gdp growth rate was selected as an auxiliary variable to reflect the effect of economic development. in this subsystem, the number of light trucks was defined as a stock variable, and the change in the number of light trucks was defined as a flow variable inspired by the growth of the gdp by constructing a table function. the formula used to calculate the aggregate energy demand of light trucks was the same as the one used for heavy trucks. despite the . % modal share of the city buses, they were incorporated into the model to reflect a complete picture of the transportation system in kaohsiung city. in this subsystem, the number of city buses was selected as a stock variable, and its value is influenced by the flow variable the annual change in the number of city buses. the historical value of auxiliary variables and the government-set target determined the number of city buses through the feedback loops. to improve the quality of the city bus service, the city bus operation agencies added buses since by adjusting the frequency and routes of city buses, releasing government-run routes to private enterprises, enhancing realtime bus information, and upgrading the service quality. . discussion of analytical results to detect the effectiveness of the proposed model, the simulation results were validated by comparing the estimated values with their historical trends [ , , , , , ] . the examined variables included urban population, individual disposable income, motorcycles, private cars, light trucks, and heavy trucks (tables and ). the model developed in this study appears to be reasonable because the relative errors were all less than % [ ] . the behavior analyzed using the reference model was simulated from to based on existing socioeconomic conditions and policies. the decline in the natural population of taiwan for the past years has lowered the growth rate of the urban population. a decreasing natural population is both the current and the future trend in most developed countries. our simulation predicted that in , the population of kaohsiung would gradually decline to . million, , fewer than today's population (table ). global insight projected that the annual economic growth rate of taiwan in would be . % higher than it is in . however, in the next years, the growth rate of the gdp of taiwan is expected to be lower than those during the past two decades. given this slowdown in economic activity, individual disposable income will grow at only a moderate rate. for example, our simulation predicts that the annual growth rate of individual disposable income from to will be . %, which is lower than previous rates. our simulation also predicts that this income will reach nt$ , in ( us dollar = nt$). the simulation indicated that the number of motorcycles will increase by , vehicles between and , which is a growth rate of . % per year. this growth in the number of motorcycles and private cars is attributed to the size of the economically active population, the level of individual disposable income, and the variation of fuel prices. similarly, the simulation estimated that the number of private cars in will be , , a decrease of . % over years. economic weaknesses will also cause a slow growth rate ( . %) in the number of heavy trucks until , when , of such vehicles will show an increase of . % compared with the number in . the effect of a lower gdp growth rate on the number of light trucks will be limited because they are used for daily commodity exchanges and business transactions. the simulation showed that the number of light trucks will have grown an average of . % per year and will have reached , vehicles in . after , the aggregate energy consumed by motor vehicles will increase by . % until . the aggregate increase in co emissions will be nearly , metric tons between and , which is . % higher than the emission level in . most of our simulated results have an estimation error lower than % except for rare cases. therefore, the prediction capability of our model is acceptable [ ] . the reason for the main percentage errors concentrated between and may be the severe acute respiratory syndrome (sars) outbreak in taiwan between and . sars caused widespread social disruption and economic losses, and its economic effect has been considerable in taiwan. moreover, after taiwan's first experience of party alternation in , the government system experienced instability in the early stages that led to the negative effect on economic propensity and motor-vehicle growth. these major unusual events caused the disturbance in our model predictions during this period. we failed to fully understand the real effect of co emission and energy use reduction under various transportation policies because the data were limited. to demonstrate the accuracy of our proposed model, a comparison is performed between real data from to and the estimated number of motor vehicles in the reference model during the same period, after a free bus policy was implemented in . the deviation between simulated data and real data is within %, which is reasonable [ ] . the reason for the reduced population in can be that the increasing labor costs have encouraged numerous manufacturers to leave kaohsiung, which has reduced the number of residents in the city. among all strategies for sustainable transport policy, the implementation of programs including those that encourage the use of the public transportation system using benefits, such as subsidies, free transfers, or transfer discounts, and deterrents (e.g., restraining the use of private vehicles by parking management and levying taxes on fuel oil), is mostly discussed and encouraged in taiwan. furthermore, the taiwanese government is considering an additional nt$ . per liter tax on fuel prices to reflect social justice and the user-pays principle and to restrain the use of private vehicles. thus, based on the various assumptions and the past trends of the variables in the reference model, the policies including fuel tax, motorcycle parking management, free bus service, and synthetic policy are discussed in this study to explore their energy-saving and co -emission-reducing potential (see tables and ) . we analyzed three scenarios considering including low, medium, and high oil price in our revised paper (see tables - ). we used the average oil price to represent the medium price; high oil price can be estimated by the average oil price plus one standard deviation of oil price. lastly, low oil price can be estimated by the average oil price minus one standard deviation of oil price. this study examines the appropriate urban transportation policies that mitigate a global warming effect mainly from co emission. nitrous oxides (nox), hydrocarbons (hc), co, and soot emissions affect the health of urban populations. however, due to data limitation, we assume that the relationship between co emission and nox, hc, co, and soot emissions is of proportional equivalence. the estimated nox, hc, co, and soot emissions are included in. the detailed study regarding the precise toxicity of the emissions in the model can be further examined in future we simulated the scenario of a fuel tax because the increase in oil price not only influences the transportation mode choice but also reduces the amount of vehicular energy consumption. therefore, oil price is a relatively direct and efficient incentive for inducing consumers to reduce private vehicle use, which lowers fuel consumption and co emissions. currently, a fixed fuel tax is levied per year according to the engine capacity of vehicles in taiwan. an additional nt$ per liter of fuel tax will also be considered in the next years. in our simulation, once the tax was included in the prices of gasoline and oil and levied according to the amount of fuel used, the numbers of motorcycles and automobiles used in kaohsiung were both predicted to decrease. overall, the number of vehicles in is estimated to be . million (fig. ) , which is . % lower than the base. this reduction in the number of motor vehicles caused by the increase in fuel prices will lead to changes in the modal shares of the means of transport. under this policy, the projected growth of vehicular energy consumption varies from , kl to , kl from to . during the same period, motor-vehicle co emissions are expected to increase by , metric tons. compared with the reference model, the energy requirements and co emissions in are predicted to be . % and . % lower, respectively. the increase in the price of crude oil will not only reduce fuel consumption but will also force a transformation in traffic modes. as seen in the reference model, the number of motorcycles in kaohsiung increased because of the prior rise in fuel prices. since , the kaohsiung city government has planned a system of six regional transit centers, which are areas composed of two major and four subsidiary transit stations that link the kmrt and the shuttle bus terminals in kaohsiung into a -min access metropolitan circle. under these measures, the number of passengers carried by mass transit increased by about million. in , the taipei city government introduced a successful parking management program that prohibits motorcycles from being parked on sidewalks and in building arcades, requires payment for roadside motorcycle parking, and offers a parking-information inquiry system. the ownership of motorcycles in kaohsiung city is . %, the highest in taiwan. from the success of this policy, kaohsiung introduced a similar system to other popular centers such as night markets, train stations, and department stores since april to reduce motorcycle use. in this study, the rate of shifting from motorcycles to the kmrt, city buses, and bicycles was based on a survey made by the taipei parking management office, because the motorcycle parking management system in kaohsiung is still under implementation (fig. ) . the number of motor vehicles driven in kaohsiung will sharply decrease when the city introduces and enforces its parking management policy. our simulation estimated that the number of motor vehicles will decline to . million at the end of , which is . % fewer than in the reference model. concurrently, fuel consumption and co emissions will be . % and . % lower than in the base model. in this scenario, the simulation assumed a % increase in bus ridership if both free bus service and discounted tickets for kmrt transfers from the subway to the bus were extensively implemented to the other weekdays. the simulation outcome indicates that the number of vehicles in kaohsiung city will decrease by . % (fig. ) . in , the number of motor vehicles will reach . million, whereas the vehicular fuel requirement will decrease by only . %. by , the need for vehicular fuel will increase by , kl. the change in energy consumption will also imply the estimated increase of co emissions to be . million metric tons in , which is . million more than in . to evaluate the maximum potential for vehicular fuel and co reduction, the interventions act together as a package of measure, which is also considered in this study. according to the simulation, the results indicate that the number of vehicles in kaohsiung city will decrease to . million vehicles in , which is . % lower than the base model (fig. ). in terms of the variation in vehicular fuel requirement, the variable will increase slightly from , kl to , kl from to . compared with the value at the end of , the value is . % lower than that in the reference model. the growth patterns of co emission and energy demand are similar because the variation of co levels is directly related to energy consumption. thus, the contribution of aggregate co emission will be . million metric tons until , which is lower by . million metric tons than the emission amount in the base model. from the observations of the forecasted patterns, the aggregate co emission needs to be reduced by about . million metric tons compared with the emission level in . this result implies the difficulties and urgency of co mitigation in kaohsiung city, despite the synthetic policy being considered in the sd model. furthermore, we simulated the scenario of a fuel tax because the increase in oil prices not only influences the mode choice but also reduces the amount of vehicular energy consumption. therefore, oil price is a relatively direct and efficient incentive for inducing consumers to reduce private vehicle use, which lowers fuel consumption and co emissions. despite the limited effect of the separate policies of motorcycle parking management and free bus service on reducing vehicular fuel consumption, the government was able to reduce the number of private vehicles in use and promote the use of the public-transit system. thus, we suggest that all three policies can be implemented simultaneously to restrain the growth of the number of private vehicles, motor-vehicle fuel consumption, and co emissions in kaohsiung. with regard the effect of various policies, the number of motor vehicles, co emission, and emergency consumption significantly decreased between and , and the reason can probably be the global financial crisis during this period because this negative influence caused the slowdown of economic development (see figs. - ). the sd model is not only able to analyze a system with many interrelated variables but is also able to describe its dynamic trends based on a limited information set. by using a simplified sd model, which we constructed to analyze issues of urban population, disposable income, number of motor vehicles, vehicular energy consumption, and co emissions, we conclude that the fuel tax policy is the most effective method to reduce vehicular fuel consumption and co emissions. this policy is even more effective than the motorcycle parking management and free bus service policies. according to the investigation of the motc of taiwan, the fluctuations in fuel prices affect the number of private vehicles driven and the distances they are driven. for instance, the use of private cars and motorcycles decreased by . % and . %, but the rate of transfer from driving private cars to driving motorcycles was . % when the average price of gasoline increased by . %. the simulation of a fuel tax also suggests that the increase in fuel prices will lead to changes in the modal shares of the means of transport. the number of motor vehicles in kaohsiung will decline by . % in , with a . % decrease in the actual number of registered motor vehicles in the city between and . the fuel tax will also cause a considerable reduction in the growth rates of vehicular fuel use and co emissions. the motorcycle parking management policy will also cause a . % decrease in the number of motor vehicles by , as well as . % and . % reductions in fuel demand and co emissions, respectively. an extensively implemented free bus service will reduce the number of motor vehicles and the fuel requirement by only . % and . %, respectively. furthermore, the maximum potential of vehicular fuel consumption and co reduction is suggested in the scenario of all the interventions acting together as a package of measure. in , the aggregate vehicular energy requirement and co emission will reach , kl and . million metric tons, respectively, which suggests a . % and a . % decrease in energy requirement and co emission compared with the reference model. simulation results indicate that both the fuel tax and motorcycle parking management policies are suggested as potentially the most effective methods for restraining the growth of the number of private vehicles, the amount of fuel consumption, and co emissions. we conducted a synthetic policy consisting of all policies which outperforms the three individual policies. compared with other countries, taiwan is densely populated (its average population density is persons/sq. km. of ) and has limited energy resources. in terms of energy consumption, the taiwanese economy is sensitive to oil price variations because the country lacks conventional energy resources and is highly dependent on energy imports (nearly % of total energy consumption). similar to the case of south korea, road transportation in taiwan accounts for more than % of co emission of the transport sector [ ] . taiwan is not yet a member of the united nations framework convention on climate change. the country's co emission increased significantly over the past two decades, making taiwan the rd largest co emitter in the world [ ] . taiwan's transportation sector accounted for % of the country's co emission in . taiwan, which is newly transformed from a developing country to a developed country [ ] , pursues economic development even with limited energy resources. therefore, finding a compromise between economic development and energy consumption as well as co emission is a critical issue for taiwan. many transferable lessons can be learned from taiwan's experience and can be a useful reference for countries with analogous characteristics such as economic development pattern, high population density, and high energy dependence. with respect to generalizability of the proposed model, this study proposes policies to restrain the use of private vehicles, for example, by increasing fuel tax and launching a strict motorcycle parking management strategy. this study also examines the policy of providing free bus service from the perspective of increasing public transportation service supply and enhancing service quality to decrease urban transportation energy consumption and co emission. in this study, we present the example of kaohsiung, a city that is highly dependent on using private vehicles (i.e., every two residents have one motorcycle, and every three residents have one private car). the lessons from kaohsiung are applicable to other cities with similar population density, urban environment, and economic development pattern, especially asian cities, such as bangkok, kuala lumpur, and ho chi minh, which are characterized by high popularity of motorcycles and limited public transportation services. the proposed sd model examines the factors, including the influence of gdp evolution, population growth, and individual disposable income, on urban transportation energy consumption and co emission of various urban transportation systems simultaneously. the model also considers the interactions among these factors over time to assess the effectiveness of various urban transportation policies. cities can modify our proposed approach according to their specific urban environment, economic development pattern, and public transportation service level to derive an appropriate model to understand the influence of urban transportation policies on energy consumption and co emission. the sd model can also be applied to other programs such as urban planning, low emission vehicles, speed limits, high occupancy vehicle control lanes, strengthening energy conservation standards for new vehicles, and other aspects of transportation are certainly considerable. they provide a helpful reference for city governments in urban development planning and setting policies associated with transport-related energy policies. the cost of implementing a free bus policy needs a certain amount to subsidize the ticket price of passengers. in , the central government provided . million us dollars to kaohsiung to implement a free bus policy for two months. the motorcycle parking management and fuel tax policies need the extra administration and resources to pay the costs. compared with the latter policies, implementing a free bus policy seems to be a more costly policy. among the three proposed policies, the fuel tax policy seems to be the most cost effective. the information with respect the cost of implementing different policy measures is useful for the urban planner and the decision maker. however, due to the data limitation, the precise cost-benefit analysis of various scenarios can be implemented in the future studies. system in kaohsiung city classification notation/data source environmental sustainability: a definition for environmental professionals defining and measuring progress towards a sustainable transport system. trb sustainable transportation indicators (sti) discussion paper sustainable transport: analysis frameworks energy and exergy efficiencies in turkish transportation sector the importance of decoupling between freight transport and economic growth key world energy statistics world energy outlook . international energy agency taiwan's statistical abstract of transportation and communications. ministry of transportation and communications r.o.c estimating fuel demand elasticities to evaluate co emissions: panel data evidence for the lisbon metropolitan area scenario-based co emissions reduction potential and energy use in republic of korea's passenger vehicle fleet assessing greenhouse gas and related air pollutant emissions from road traffic counts: a case study for mauritius regional disparity of urban passenger transport associated ghg (greenhouse gas) emissions in china: a review strategies and instruments for low-carbon urban transport: an international review on trends and effects land use policies and transport emissions: modeling the impact of trip speed, vehicle characteristics and residential location a dynamic modeling approach to highway sustainability: strategies to reduce overall impact an integrated approach to improving fossil fuel emissions scenarios with urban ecosystem studies effect of resource allocation policies on urban transport diversity synthesising carbon emission for mega-cities: a static spatial microsimulation of transport co from urban travel in beijing industrial dynamics world dynamics environmental sustainability in an agricultural development project: a system dynamics approach drought adaptation policy development and assessment in east africa using hydrologic and system dynamics modeling an exploration of dynamical systems modeling as a decision tool for environmental policy application of system dynamics in environmental risk management of project management for external stakeholders long-term perspectives on world metal use -a system dynamics model aggregate analysis of manufacturing systems using system dynamics and anp application of a system dynamics approach for assessment and mitigation of co emissions from the cement industry elucidating the industrial cluster effect from a system dynamics perspective an integrated system dynamics model for strategic capacity planning in closed-loop recycling networks: a dynamic analysis for the paper industry system dynamics modelling of a production and inventory system for remanufacturing to evaluate system improvement strategies reducing carbon emissions in china: industrial structural upgrade based on system dynamics strategy support models the role of system dynamics in project management analyzing price and product strategies with a comprehensive system dynamics model -a case study from the capital goods industry the control of goods transportation growth by modal share re-planning: the role of a carbon tax the impact of transportation disruptions on supply chain performance a system dynamics model of co mitigation in china's intercity passenger transport system dynamics model of urban transportation system and its application a system dynamics model for the sustainable land use planning and development a system dynamics modeling for urban air pollution: a case study of tehran, iran urban ecosystems, energetic, hierarchies, and ecological economics of taipei metropolis spatial dynamic modeling and urban land use transformation: a simulation approach to assessing the costs of urban sprawl energetic mechanisms and development of an urban landscape system energy management in lucknow city towards greening the us residential building stock: a system dynamics approach application of system dynamics and fuzzy logic to forecasting of municipal solid waste forecasting municipal solid waste generation in a fastgrowing urban region with system dynamics modeling modeling of urban solid waste management system: the case of dhaka city a system dynamics approach for hospital waste management a system dynamics model for determining the waste disposal charging fee in construction system dynamics of euthrophication processes in lakes a system dynamics approach for regional environmental planning and management: a study for the lake erhai basin water resources planning based on complex system dynamics: a case study of tianjin city managing water in complex systems: an integrated water resources model for saskatchewan a review on global fuel economy standards, labels and technologies in the transportation sector a study of fuel efficiency and emission policy impact on optimal vehicle design decisions distributional effects of taxing transport fuel what does a one-month free bus ticket do to habitual drivers? an experimental analysis of habit and attitude change the impact of ''free'' public transport: the case of brussels free bus passes, use of public transport and obesity among older people in england managerial applications of system dynamics a system dynamic model of cyclical office oversupply understanding models with vensim™ simulation with system dynamics and fuzzy reasoning of a tax policy to reduce co emissions in the residential sector traffic density as a surrogate measure of environmental exposures in studies of air pollution health effects: longterm mortality in a cohort of us veterans bureau of energy: ministry of economic affairs department of budget: accounting and statistics kaohsiung city government statistical yearbook of the interior. ministry of the interior, r.o.c vehicle ownership to : implications for energy use and emissions income's effect on car and vehicle ownership link between population and number of vehicles determinants of car ownership in rural and urban areas: a pseudopanel analysis transport scenarios in two metropolitan cities in india: delhi and mumbai reliability of territory-wide car ownership estimates in hong kong an econometric analysis of motorcycle ownership in the uk exploring the vehicle dependence behind mode choice: evidence of motorcycle dependence in taipei a multivariate cointegrating vector auto regressive model of freight transport demand: evidence from indian railways transport intensity in europe-indicators and trends towards a theory of decoupling: degrees of decoupling in the eu and the case of road traffic in finland between taiwan emission database system (teds . ). taiwan environmental protection administration decomposition and decoupling effects of carbon dioxide emission from highway transportation in taiwan epidemiological features of intestinal infection with entamoeba histolytica in taiwan the survey of fuel price increased on mode choice. ministry of transportation and communications r.o.c institute of transportation: ministry of transportation and communications r.o.c department of energy statistical report of transportation and communications. ministry of transportation and communications r.o.c research on urban housing trip and trip characteristic analysis. ministry of transportation and communications r.o.c parking practices and policies under rapid motorization: the case of china city research on kaohsiung urban housing trip and trip characteristic analysis the authors of this paper sincerely acknowledge the valuable suggestions of the editor-in-chief, professor j. yan and the three anonymous reviewers, which have immensely helped to enhance the quality of the paper over its earlier version. the authors also thank ms. yung-ching, chiu and ms. han-ning, chang for their assistance in data collection and processing. the statistical yearbook of kaohsiung city [ ] key: cord- -o mwd d authors: tam, ka-ming; walker, nicholas; moreno, juana title: projected development of covid- in louisiana date: - - journal: nan doi: nan sha: doc_id: cord_uid: o mwd d at the time of writing, louisiana has the third highest covid- infection per capita in the united states. the state government issued a stay-at-home order effective march rd. we analyze the projected spread of covid- in louisiana without including the effects of the stay-at-home order. we predict that a large fraction of the state population would be infected without the mitigation efforts, and would certainly overwhelm the capacity of louisiana health care system. we further predict the outcomes with different degrees of reduction in the infection rate. more than % of reduction is required to cap the number of infected to under one million. the identification and verification of human-to-human transmission of the coronavirus disease in early january of in wuhan, china triggered the start of a worldwide pandemic. as of april , there are more than . million confirmed cases and more than , deaths attributed to covid- . the first case in the us was confirmed in washington state on january . the number of reported cases until early march was rather low. the exceedingly slow spreading rate in these early months may be partially due to the lack of adequate testing, which remains a major issue at the time of writing. the cases dramatically increased in the usa in early march, with most cases in the states of washington, new york, and california. it was not until march that the first case in louisiana was identified. the growth rate of infections in louisiana has been alarming since the confirmation of the first case. louisiana state government responded swiftly by closing all k- public schools on march . on march , public gatherings of more than people were prohibited, and bars, bowling alleys, casinos, fitness facilities, and movie theaters were closed. furthermore, a stay-at-home order was issued on march . adequate testing for covid- remains limited in the usa. for this reason, accurately predicting the trajectory of the spread of covid- by relying on the number of confirmed cases alone is a rather questionable approach. while the susceptible-infected-recovered (sir) model may well describe the dynamics of the spreading , , accurate predictions rely on knowing the number of confirmed cases, which is severely hampered by the limitations of testing. this is particular significant in the early stages of the spread of the disease when the percentage of people tested is very small, and the spread by infected people who are asymptomatic is very significant. alternatively, the number of fatalities attributed to covid- may be a more reliable parameter for tracing the dynamics of the virus spread. combining this information with the mortality rate can be a better strategy to predict the number of cases than relying on the con-firmed infection count alone. the goal of this paper is to extract the dynamics of covid- in louisiana from the data of the death count supplemented with the confirmed cases. we then run several scenarios with different reduction of the infection rate and calculate the number of people infected in each case. we conclude with suggestions to improve the model and, as consequence, its predictions. our model is based on the susceptible-infected-recovered (sir) model with the modification of including the number of quarantined people (q), as has been considered elsewhere. [ ] [ ] [ ] the equations defining the model are the following: where n is the total population size, s is the susceptible population count, i is the unidentified infected population count, q is the number of identified cases, and r includes the number of recovered and dead patients. the model is characterized by the following parameters: β is the infection rate, η is the detection rate, α is the recovery rate of asymptomatic people, and γ includes the recovery rate and the casualty rate of the quarantined patients. this model is equivalent to the standard sir model if we are not interested into differentiating between q and r. we further assume that the rate of increase in the number of casualties is proportional to the number of infected at the early stage of the epidemic, where δ is the mortality rate. this is a good approximation at the beginning of the virus spread when the number of quarantined patients is a small percentage of the total population. this equation is not combined in any way with eq. - , it is only used to estimate the model parameters at the start of the epidemic. we first consider eq. , assuming the susceptible population count is very close to that of the total population, s ∼ n , which is justifiable at the beginning of the epidemic since only a small fraction of the population is infected. with this assumption one can decouple the infected population count from the other parameters to obtain: , solving eq. , the casualty count as a function of time can be written as the exponential growth of the number of fatalities at the beginning of the epidemic should represent the spreading of covid- reasonably well since the mechanisms for slowing the dynamics, such as improved detection and social distancing, are delayed in time by fitting the available fatalities data (see appendix) between march and to eq. , the parameters of the model can be determined. fig. displays the fit which provides an estimate of c(t) ≈ . exp [ . t]. the dynamics (exponent) is thus given as β − (α +η) = . . from the value of the exponent we can estimate the time for doubling the casualties count: ln( )/ . ≈ . days. moreover, the proportionality constant can be used to estimate the initial number of infections i( ) if the mortality rate δ is known. the mortality rate is estimated by combining the accumulated mortality rate data and the median time between infection and death. it is estimated that the median time between infection and the onset of symptoms is about five days, while the median time between the onset of symptoms and death is eight days. [ ] [ ] [ ] [ ] it is worth noting that the distribution of these time periods is close to a log-normal, thus a more sophisticated analysis should include the effects of the non-self-averaging behavior of the distribution. only the median values are used in the present work. the parameters of eq. are fit to the data, providing an approximation to the number of deaths as a function of time: the accumulated mortality rate is estimated to be . % . notably, the mortality rate does indeed vary by region. this may be due to the rate of testing as well as the capacity of health care facilities. for areas in which health care facilities have been overrun, the death rate would be much higher. notwithstanding these uncertainties, assuming that the health care facilities have not yet been overrun, the mortality rate is estimated to be δ ≈ . + ≈ . . this also provides an estimation of the number of persons who carries the virus but not detected at day , i( ), which is given as i( ) ≈ . δ × . ≈ . this reveals that even as early as march , the number of infected people is already at the order of hundreds. now we consider the number of confirmed cases at the start of the epidemic, p (t). this is given by the sum of q(t) and r(t) subtracted by the number of persons who recovered without being tested. the rate of change on the number of reported cases can be obtained by combining eqs. and and subtracting αi(t): with i(t) given by eq. , we obtain: , by fitting the number of confirmed cases (fig. η ≈ × . ≈ . . there remains one parameter to be determined, the recovery rate of asymptomatic people, α. assuming that the average time or recovery or dead are both days, and half of the infected never show any symptoms thus they are not been tested . we can estimate α = . / ≈ . . this is probably the upper bound of the estimate, in reality this could be smaller. this additionally provides the value for β as . . with these parameters, eqs. - can be solved and used to predict the spread of the disease. fig. displays the time evolution of the number of unidentified persons who carry the virus, i(t), the number of persons who are either in quarantine or recovered, q(t) + r(t), and the total number of persons who have ever been infected, q(t)+r(t)+i(t). the number of infections but unidentified, i(t), grows exponentially, as expected from eq. , at the initial stage, and this behavior continues until about day , when around , people are infectious. the exponent of ∼ . suggests the number of cases double approximately every three days, which seems to be consistent with the data in many areas of the world before the mitigation efforts are kicked in. after day , the rate of increase slow down due to the combination of the decrease on the number of susceptible (uninfected) people and the increase on the number of recoveries. the number of infected cases ceases to grow exponentially, but rather becomes a stable but constant increases until peaking at around day , corresponding to early may. on the other hand, the number of quarantined and recovered people resembles a logistic function. to compare with other states which already have widespread epidemic, we use the described method to calculate the infection rate (β), the testing rate (η), and the reproduction number (r = β/(η + α)) of selected states. result are displayed in table i . note than the reproduction number of louisiana is the highest among the states listed in the table. within the present model, there are two major routes to slow the initial exponential growth of the epidemic, which is characterized by the parameter β − (α + η). the first one is to decrease the infection rate, β. the second route is to increase the testing rate, η. to increase the recovery rate from unidentified persons, α, can also reduce the spread, but it is unlikely to be achieved. as the stay-at-home order was issued on march , it is expected that the infection rate should be drastically reduced. we simulate new scenarios with the assumption that social contact is reduced so that the infection rate decreases by fig. : the number of people who are infected and carrying the virus without being identified, i(t), as a function of time, with march as day . we assume the mitigation efforts reduce the infection rate by %, %, %, %, and % from day ( days after the stay-at-home order), and the sum of the testing rate and recovery rate of asymptomatic people remains unchanged. the inset is a zoom for the first days. %, %, %, %, and % starting at day . the results are shown in fig. and . we find that there is a substantial drop of the active virus carriers even with a % reduction in the infection rate. however, the number of people who will be infected still exceeds one million if the reduction in the infection rate is smaller than %. this suggests the importance of strict measures in social distancing. perhaps it also suggests the importance of wearing basic protective gear to further reduce the infection rate. there are many uncertainties in this simplified model which can be improved over time as more data become available. improvement can be achieved by including additional factors, such as correlation with different age groups, correlation with the health condition of the population, the availability of public health care, the effect of higher ambient temperature and humidity, and many others. some of those factors are likely beyond the sir model which implicitly assume that the population is homogeneous and well mixed, and that infection occurs without time delay. however, given the rather limited data available today, it is not clear that more sophisticated models may provide much better predictions. in spite of the rather simple model being employed in this analysis, it provides a baseline for the spread of the covid- in louisiana in the absence of mitigation efforts. the situation is clearly dire, as a very large fraction of the population will get infected with a peak on the number of infections around early may. with the current mitigation efforts, we expect the infection rate will be greatly reduced. currently, we do not have data to support the effectiveness of current mitigation efforts as the trend still fits rather well to the initial stage of exponential growth. the main projection from this work is that more than % of reduction in the infection rate is needed to keep the infected count below one million. increasing testing capacity and providing protective gear to further reduce the infection rate seem to be reasonable measures. the covid irccs san matteo pavia task force coronavirus disease (covid ) situation report- key: cord- - mdie v authors: valle, denis; albuquerque, pedro; zhao, qing; barberan, albert; fletcher, robert j. title: extending the latent dirichlet allocation model to presence/absence data: a case study on north american breeding birds and biogeographical shifts expected from climate change date: - - journal: glob chang biol doi: . /gcb. sha: doc_id: cord_uid: mdie v understanding how species composition varies across space and time is fundamental to ecology. while multiple methods having been created to characterize this variation through the identification of groups of species that tend to co‐occur, most of these methods unfortunately are not able to represent gradual variation in species composition. the latent dirichlet allocation (lda) model is a mixed‐membership method that can represent gradual changes in community structure by delineating overlapping groups of species, but its use has been limited because it requires abundance data and requires users to a priori set the number of groups. we substantially extend lda to accommodate widely available presence/absence data and to simultaneously determine the optimal number of groups. using simulated data, we show that this model is able to accurately determine the true number of groups, estimate the underlying parameters, and fit with the data. we illustrate this method with data from the north american breeding bird survey (bbs). overall, our model identified main bird groups, revealing striking spatial patterns for each group, many of which were closely associated with temperature and precipitation gradients. furthermore, by comparing the estimated proportion of each group for two time periods ( – and – ), our results indicate that nine (of ) breeding bird groups exhibited an expansion northward and contraction southward of their ranges, revealing subtle but important community‐level biodiversity changes at a continental scale that are consistent with those expected under climate change. our proposed method is likely to find multiple uses in ecology, being a valuable addition to the toolkit of ecologists. occur in space and time. for example, in a spatial context, these approaches have attempted to identify geographic areas with similar taxa, areas that have been variously called "biogeographical regions" (gonzales-orozco, thornhill, knerr, laffan, & miller, ) , "bioregions" (bloomfield et al., ) , or "biogeographical modules" (carstensen et al., ) . such bioregions have been argued to be important for understanding the role of history on community assemblages (carstensen et al., (carstensen et al., , , interpreting ecological dynamics (economo et al., ) , and developing broad-scale conservation strategies (vilhena & antonelli, ) . the latent dirichlet allocation (lda; not to be confused with linear discriminant analysis) model is a powerful model-based method to decompose species assemblage data into groups of species that tend to co-occur in space and/or time. the benefits of using this model include the ability to adequately represent uncertainty, accommodate missing data, and, perhaps most importantly, to describe sampling units as comprised of multiple groups (i.e., mixedmembership [mm] units) (valle, baiser, woodall, & chazdon, ) . conceptually, the ability to describe sampling units as comprised of multiple groups has rarely been considered in previous methods (i.e., prior approaches are typically based on "hard" partitions) but may better honor community dynamics and may better characterize impacts of environmental change. for instance, biome transition zones, ecotones, and habitat edges are locations that are often comprised of a mix of species groups, providing sources for potentially novel species interactions (gosz, ; ries, fletcher, battin, & sisk, ) . similarly, climate change is predicted to cause geographic shifts in species and communities, leading to the hypothesis of novel assemblages arising across space as climate and habitat changes (urban et al., ; williams & jackson, ) . in addition, most partitioning methods that delineate biogeographical regions or modules based on hard boundaries can lead to high uncertainty in boundary delineation-an issue that can be rectified by allowing groups to overlap. it is important to note that lda allows for overlapping groups, but it does not require it to be present (i.e., if data do not support overlap, no overlap is estimated). it is unfortunate that the lda model, as currently developed, has been restricted to abundance data, which are often not available because accurate quantification of abundance can be very challenging and costly. in the absence of abundance data, researchers often have to rely on presence/absence data to understand species distributions and biodiversity patterns (jones, ; joseph, field, wilcox, & possingham, ) . another limitation of the lda model is that the number of groups has to be prespecified, requiring researchers to run lda multiple times to then use some criterion (e.g., aic) to choose the optimal number of groups (e.g., valle et al., ) , an approach that often can be computationally costly. in this article, we substantially develop the lda model to be able to fit the much more commonly available presence/absence data and to automatically determine the optimal number of groups. we start by describing our statistical model. then, using simulated data, we show how our method automatically detects the optimal number of groups in the data, reliably estimates the underlying parameters, and better fits the data, outperforming other approaches. at last, we illustrate the novel insights gained using our method by analyzing a long-term dataset collected on breeding birds in the united states and canada (breeding bird survey [bbs]; pardieck, ziolkowski, lutmerding, campbell, & hudson, ) to determine how environmental variables influence bird assemblages across the continent and how these assemblages are changing through time. the overall goal of our method is to identify the major patterns of species co-occurrence in the data, each of which we define to be a distinct species group. we adopt the term species group (instead of "bioregion" or other related terms) because these major co-occurrence patterns do not have to have a strong spatial pattern (although they often do), these groups can overlap in space, and proportion of groups can change through time. more specifically, our method characterizes each sampling unit l in terms of the proportion of the different groups (parameter vector θ l ) and characterizes each group k in terms of the probability of the different species (parameter vector ϕ k ). for example, θ l ¼ : ; : ; : ; ½ indicates that the second group dominates unit l and that the fourth group is absent. this example also illustrates that a given sampling unit can be comprised of multiple groups, which explains why these types of models are called mixed-membership models. in the same way, ϕ k ¼ ; : ; : ½ indicates that species and (but not species ) are important species of group k. note that a given species can have a high probability in more than one group. a more formal description of the statistical model is given below. the data consist of a matrix filled with binary variables x isl (i.e., equal to one if species s was present in observation i and unit l and equal to zero otherwise). notice that we assume that multiple observations might have been made for each species s and unit l, possibly due to temporally repeated measures or because multiple subsamples were measured within each unit l (e.g., a forest plot comprised of four subplots). each of these binary variables has an associated latent group membership status z isl . this variable indicates to which group species s in sampling unit l during observation i comes from. we assume that each observation x isl comes from the following distribution, given that species s in unit l during observation i comes where ϕ sk is the probability of observing species s if this species came from group k. notice that z isl influences the distribution for x isl by determining the k subscript of the parameter ϕ. next, we assume that the latent variable z isl comes from a multinomial distribution: where θ l is a vector of probabilities that sum to one, and each element θ lk consists of the probability of a species in unit l to have come from group k. in relation to the priors for our parameters, we adopt a conjugate beta prior for ϕ sk : throughout this article, we assume vague priors by setting a and b to . building on the work of dunson ( ) and valle et al. ( ) , we adopt a truncated stick-breaking prior for θ l . this prior assumes that: for k = ,…,k− and γ > . we set the parameter for the last group to (i.e., v lk ¼ ). with these parameters, we calculate θ lk using the under this prior, θ lk is a priori stochastically exponentially decreasing as long as γ < and smaller γ tend to enforce greater sparseness (i.e., the existence of fewer groups). for most of the examples in this article, γ was set to . , which we have found to work well for multiple datasets. more details regarding this prior can be found in supporting information appendix s . the benefit of this prior is that, if the data support fewer groups than specified by the user, it will tend to force these superfluous additional groups to be empty or to have very few latent variables z isl assigned to them, as illustrated in the simulation section below. this prior also helps to avoid label switching, a common problem in mixed-membership and mixture models. bayesian markov chain monte carlo (mcmc) algorithms applied to these types of models sometimes mix poorly and can lead to nonsensical results if posterior distributions of parameters are summarized by their averages (stephens, ) . the label switching problem refers to the fact that the labels of the different groups can change (e.g., groups and can become groups and , respectively) without changing the likelihood (i.e., the group labels are unidentifiable). our truncated stick-breaking prior helps to avoid the label switching problem by enforcing an ordering of the groups according to their overall proportions. we fit the lda using a gibbs sampler. a more complete description of this model and the derivation of the full conditional distributions used within this gibbs sampler are provided in supporting information appendix s . supporting information appendix s contains a short tutorial describing how to fit the model using the code that we make publicly available, reproducing some of the simulated data results. there are three important points regarding lda that need to be emphasized. first, the proposed model can accommodate negative and positive correlations between species. to illustrate this, assume that there are just two species groups and two species, s and s'. negative correlation between these species is captured by our . these parameter estimates indicate that, whenever a site has a high proportion of group , species s will have a high probability of occurring, whereas species s' will tend to be absent. in the same way, whenever a site has a high proportion of group , species s' will have a high probability of occurring but species s will tend to be absent, resulting in negative correlation. positive correlation between these species is captured by our model if, for example,φ s ¼ : : ! and ϕ s ¼ : : ! . these parameter estimates imply that, whenever a site has a high proportion of group , both species s and s' will have a high probability of occurring. in the same way, whenever a site has a high proportion of group , both species s and s' will have a high probability of being absent, inducing positive correlation. second, hard clustering methods that group locations with similar species composition (e.g., kreft & jetz, ) correspond in our model to vectors θ l comprised of zeroes except for a single element which is equal to . in the same way, hard clustering methods that group species that tend to co-occur (e.g., azeria et al., ) from a species composition perspective. this is due to the fact that the probability of observing species s for two locations p and q is (see supporting information appendix s for details). in this scenario, the algorithm might determine that a single species group dominates all locations instead of distinguishing the different species groups. we simulate data to evaluate the performance of the proposed model and to compare its results to those from other clustering methods. to avoid the identifiability problems described above, we generate parameters for all simulations such that each group completely dominates at least one location and that each group has at least one species that is never present in the other groups (ensuring distinct species composition of these groups). we illustrate with simulated data how the truncated stick-breaking prior can identify the optimal number of groups and how our algorithm can retrieve the true parameter values under a wide range of conditions. more specifically, the true number of groups k* was set to and ; the number of sampling units (i.e., locations) was set to and ; the number of species was set to and ; and the number of observations per location was set to . parameters were drawn randomly (i.e., ϕ sk ∼ beta : ; : ð Þ and θ l ∼ dirichlet : ð Þ), and the identifiability assumptions described above were then imposed. we adopted a beta : ; : ð Þdistribution for ϕ sk because this distribution is likely to generate species groups that are more dissimilar in terms of species composition given that it is a u-shaped symmetric distribution. we generated datasets for each combination of these settings, totaling datasets. to fit these data, we assume a maximum of groups (k = ) and estimate the true number of groups k* by determining the number of groups that are not superfluous. superfluous groups are defined to be those groups that are very uncommon across the entire region (i.e., θ lk < : for % of the locations, where θ lk is the mean of the posterior distribution). at last, we test the sensitivity of the modeling results to the prior by fitting these data with γ set to . and . we also compare lda to other methods using simulated data. in these simulations, we assume data are available on species over , locations, with five repeated observations per location. furthermore, , , and groups were used to generate these data. because the goal is to compare inference from different methods, we set the parameters θ lk in such a way that it allows for a straightforward visual appraisal of the advantages and limitations of the different methods. on the other hand, the parameters ϕ ks were randomly drawn from beta : ; : ð Þ ; and subsequently, the assumption regarding groups with distinct species composition was imposed. when fitting lda, we set the maximum number of groups to and rely on the truncated stick-breaking prior with γ ¼ : to uncover the correct number of groups. we compare and contrast inference from our model to that from competing approaches, including traditional hard clustering methods (i.e., hierarchical and kmean clustering) and mixture models (i.e., region of common profile (rcp) model; foster et al., ; lyons, foster, & keith, to determine whether these breeding bird groups have been shifting their spatial distribution over time, we divided our study period into two -year periods: - and - . each route × period combination resulted in a distinct "sampling unit" (i.e., distinct row in our data matrix), and data from individual years within each time period were treated as repeated observations. to relate the spatial distribution of the identified groups to potential environmental drivers, we relied on freely available precipitation and temperature data from worldclim version (available at http://worldclim.org/version ) (fick & hijmans, ) . these data consist of the -years average climate information (from to ) for the month of june, covering the entire world. in an era of global change, an important feature of our method is that it is able to detect relatively subtle temporal changes in species composition. more specifically, we assessed whether group ranges had expanded north and contracted south. these are the patterns we a priori expected given warming temperatures and the strong influence of temperature on the spatial distribution of a range of taxonomic groups, including birds (chen, hill, ohlemuller, roy, & thomas, ; hitch & leberg, ; moritz et al., ; parmesan & yohe, ) . to detect these patterns, we fit the model once to data from both time periods (instead of fitting the model separately for each time period we set the maximum number of groups to for our case study. to interpolate the estimates of the proportion of different groups to unsampled areas, we relied on the inverse distance weighted (idw) algorithm implemented in the package "gstat" (graler, pebesma, & heuvelink, ; pebesma, ) . interpolations were restricted to locations within one degree of a bbs route. finally, our algorithm was programmed using a combination of c++ (through the rcpp package; eddelbuettel & francois, ) and r code (r core team ). we provide a tutorial in the supporting information appendix s for fitting this model. despite assuming the potential existence of a much higher number of groups (k = ), our results reveal that the proposed model was generally able to estimate well the true number of groups (boxplots in figure ) , except for datasets with few species and locations but many groups (i.e., locations, species, and groups; figure f) . we also find a good correspondence between the true and estimated parameter values for most of the scenarios explored (scatterplots in figure ) , with a slightly worse performance for data with few species but many groups (i.e., species and groups; figure g,h) . taken together, these results suggest that, when the ratio of the number of species to the number of groups is small, there is likely to be less distinction between groups from a species composition perspective, making it a harder task to untangle these groups. finally, in relation to the prior, we find that our results are broadly similar for γ ¼ : and γ ¼ . the main difference is that parameter estimates tended to be slightly worse when the true number of groups is and γ ¼ and when the true number of groups is and γ ¼ : (results not shown), agreeing with our expectations. because smaller γ values induce greater sparseness, parameters are better estimated with γ ¼ : when simulations are based on sparse assumptions (i.e., simulations with three groups) versus when this is not true (i.e., simulations with groups). our results reveal that the algorithm accurately estimates the proportion of the different groups in each location, regardless if mm units are present or not (left most and right most panels, respectively, in figure ). these results corroborate the observation that lda encompasses hard clustering of sites and/or species as special cases. on the other hand, figure clearly reveals that hard clustering methods cannot represent these mm locations (k-means and hierarchical clustering [hc] panels in figure ). mixture model approaches such as rcp are sometimes perceived to be able to represent these gradual changes in the proportion of groups. however, f i g u r e the latent dirichlet allocation (lda) estimates well the true number of groups (boxplots) and the θ lk parameter values (scatterplots). results from all datasets in each simulation setting are displayed simultaneously, based on lda with γ set to . . top and bottom panels display results for three and groups, respectively. boxplots in panels (a) and (f) show the estimated number of groups (i.e., the number of groups deemed not to be superfluous), revealing that lda can estimate well the true number of groups (k*) except for datasets with few locations (l), few species (s) but many groups (i.e., l s k*). scatterplots (panels b-e and g-j) reveal that the θ lk parameters can also be well estimated but there is considerable noise for datasets with few species but many groups (panels g and h). a : line and a linear regression line were added for reference (blue and red lines, respectively) [colour figure can be viewed at wileyonlinelibrary.com] our results reveal that, when applied to our simulated data, rcp tended to give transition regions that were too narrow (rcp panels in figure ). these model comparison results are particularly striking given that lda was fitted assuming potential groups, whereas results for the other methods were based on the assumption that the true number of groups was known. notice that these figures illustrate how lda can capture gradual changes in species composition associated with global change phenomena depending on what is being represented in the x-axis. for instance, the x-axis can represent a spatial gradient of anthropogenic forest disturbance (e.g., timber logging intensity or distance to forest edge) or can represent time (i.e., the same location sampled repeatedly through time, perhaps revealing the impact of climate change on species composition). recall that the simulated data were generated with , , and groups, but that the maximum number of groups when fitting lda was set to . our results suggest that the truncated stick-breaking prior was able to correctly estimate the underlying true number of groups (boxplots in figure ) given that the estimated θ lk 's were shrunk toward zero for the superfluous groups (red boxes in figure ) . we also find that all the other alternative methods required a much greater number of groups to fit the data as well as lda when mm locations are present (line graphs in figure ). these results reveal that lda achieves a much sparser representation of the data (based on the number of groups) without losing the ability to represent the inherent variability in the data. although these results are expected, given the larger number of parameters in lda, the ability to fit the data well with fewer groups is highly desirable from the user's perspective as the primary role of these methods is to reduce the dimensionality of biodiversity data. it is important to note that even in the absence of mm sampling units, lda can still estimate well the true number of groups and has similar fit to the data as the other clustering approaches (results not shown). finally, although overall, we identified main breeding bird groups (of a maximum of ) after eliminating groups that were very uncommon throughout the region (defined here as those for which θ lk was smaller than . for % of the locations, where θ lk denotes the posterior mean). an important test for any unsupervised method is if it is able to retrieve patterns that are widely acknowledged to exist by experts. using the estimated group proportion for each location for the - period, we find striking spatial patterns (maps in figure ) . importantly, these spatial patterns generally agree well with other maps of bird communities (e.g., bird conservation (figure a ). we find that the species that best f i g u r e the extended latent dirichlet allocation (lda) method identifies the true number of groups (left panels) and fits the data better than other clustering approaches for data with mm locations (right panels). results are shown separately for simulated data with , , and groups (top to bottom). boxplots depict the estimated proportion θ lk of each group k for all locations l = ,…,l. these boxplots emphasize how θ lk for the irrelevant extra groups (red boxes) are shrunk to zero for all locations. line graphs show the log likelihood, a measure of model fit for which larger values indicate better fit. these graphs reveal how other clustering approaches require a much greater number of groups to fit the data as well as lda with fewer groups. model fit results for lda correspond to the posterior mean of the log likelihood. lda results are shown with a single symbol because, differently from the other methods that were fitted multiple times with different number of groups, lda was fitted just once using a maximum of groups and the true number of groups was estimated (see corresponding boxplots). details regarding how the log likelihood was calculated for the different methods are provided in supporting find that group identifies species associated with desert environments (e.g., cactus wren and ash-throated flycatcher), while group identifies a mixture of short-grass prairie birds (e.g., dickcissel) and species associated with open country environments with scattered trees and shrubs (e.g., eastern phoebe). besides these biogeographical patterns, we also highlight the ability of our algorithm in depicting how environmental gradients are linked to the proportion of each group. for instance, we display how the main east coast groups (groups , , , , , and ) are , , , , , and along the east coast. (c) displays the spatial pattern of groups and . in both (a) and (c), higher proportion of individual groups is depicted using more opaque (i.e., less transparent) colors and different groups are depicted with different colors. (b, d) reveals that average june temperature and precipitation gradients seem to strongly constrain the spatial distribution of these breeding bird groups, respectively. circles represent the estimated proportion for each location and group while lines depict suitability envelopes. these envelopes were created by first defining equally spaced intervals on the x-axis and then calculating the median x value and the % percentile of y within each interval and connecting these results. notice that the same color scheme is used for right and left panels the latent dirichlet allocation (lda) model is a useful model for ecologists because it can more faithfully represent community dynamics and the impact of environmental change through the estimation of mixed-membership sites (valle et al., ) . the standard lda requires abundance data but, for many taxa, reliably estimating abundance is often very hard and costly (ashelford, chuzhanova, fry, jones, & weightman, ; joseph et al., ; kembel, wu, eisen, & green, ; royle, ; schloss, gevers, & westcott, ) . for these reasons, presence/absence data are typically much more ubiquitous than abundance data, often enabling analysis at f i g u r e species groups with a statistically significant association between latitude and change in group proportion using the breeding bird survey (bbs) dataset as a case study, we have shown how our method is able to uncover striking spatial and temporal patterns in bird groups. for example, we illustrate how these groups gradually change along a temperature gradient in the east coast and a precipitation gradient in texas. it has long been known that many bird species have strong relationships with abiotic gradients (bowen, ) , but how these gradients can explain entire groups of species has remained elusive. furthermore, we find subtle but pervasive changes in bird group proportions, changes which follow the expected patterns based on climate change (e.g., parmesan & yohe, ) . half of the species groups (nine of ) have expanded their northern range and shrunken their southern range. this pattern is consistent with species-specific models of changes in bird distribution with climate change in the united states (e.g., hitch & leberg, ; la sorte & thompson, ) . our results expand on these findings by illustrating how entire groups are shifting their spatial distribution. nevertheless, a more formal test that accounts for the multiple factors that influence the spatial distribution of birds will be required to ultimately confirm whether climate change is driving the spatial distribution shifts that we have detected. an important limitation of the method that we have presented is that the identified groups do not change over time, even though their spatial distribution may vary. in other words, θ lk may change with time but ϕ ks does not. this is particularly relevant in the context of climate change, where it is possible that the species composition of the groups themselves might be changing (lurgi, lopez, & montoya, ; stralberg et al., ; urban et al., ) . another important limitation in this study is that the proposed model does not take into account imperfect detection, a pervasive issue for wildlife sampling (mackenzie et al., ; royle, ) . this shortcoming can be partially attributed to inherent limitations in the bbs dataset, given that the estimation of detection probabilities requires very specific data types (e.g., repeated visits in occupancy models). it is also critical to highlight the importance of repeated observations per location given the relatively low information content in binary presence/absence data. determining all the parameters in the proposed model, including the optimal number of groups, can be challenging in the absence of these repeated observations. finally, although important broad-scale patterns can be identified and novel insights gained from post hoc analysis of lda model parameters, as illustrated with our case study, these results rely on a two-stage analysis that does not take into uncertainty in the estimated parameters. our ongoing work is focused on extending lda to accommodate covariates through regression models built-in to lda so that uncertainty can be coherently propagated when performing more formal statistical tests and when making spatial and temporal predictions. community ecologists have traditionally relied on fitting clustering models with different numbers of clusters and choosing the optimal number of clusters using metrics such as aic and bic (fraley & raftery, ; xu & wunsch, ) . using simulated data, we have shown how the truncated stick-breaking prior can aid the determination of the true number of groups. we acknowledge, however, that the modeler still has to specify the hyperparameter γ and the maximum number of groups k. using simulated data, we have found that setting γ to . often works well and that our model often identifies k groups if the true number of groups is equal or larger than k. while this may be seen as an indication that k has to be increased when using real data, an extremely large number of groups defeats the purpose of dimension reduction, making it increasingly harder to visualize and interpret model outputs. ultimately, we believe that the decision regarding the maximum number of groups k is a balance between what the data suggest and pragmatic considerations regarding how the results will be displayed and interpreted. our empirical example focused on large-scale biogeographical patterns. nevertheless, this method could also be applied in a landscape-scale context, identifying spatial variation in community structure within general habitat types and across patches, or to analyze long-term temporal changes in time-series data of species composition (e.g., christensen, harris, & ernest, ) . given the ubiquity of presence/absence data in community ecology, we believe that the extension of the latent dirichlet allocation model developed here will see a much wider use, becoming an important addition to the toolkit of community ecologists. we thank the numerous comments provided by ben baiser, daijiang http://orcid.org/ - - - robert j. fletcher jr. http://orcid.org/ - - - new screening software shows that most recent large s rrna gene clone libraries contain chimeras. applied and environmental microbiology using null model analysis of species co-occurrences to deconstruct biodiversity patterns and select indicator species. diversity and distributions a comparison of network and clustering methods to detect biogeographical regions african bird distribution in relation to temperature and rainfall care and feeding of topic models: problems, diagnostics, and improvements biogeographical modules and island roles: a comparison of wallacea and the west indies the functional biogeography of species: biogeographical species roles in wallacea and the west indies rapid range shifts of species associated with high levels of climate warming long-term community change through multiple rapid transitions in a desert rodent community nonparametric bayes applications to biostatistics breaking out of biogeographical modules: range expansion and taxon cycles in the hyperdiverse ant genus pheidole rcpp: seamless r and c++ integration worldclim : new -km spatial resolution climate surfaces for global land areas ecological grouping of survey sites when sampling artefacts are present model-based methods of classification: using the mclust software in chemometrics biogeographical regions and phytogeography of the eucalypts ecotone hierarchies spatio-temporal interpolation using gstat breeding distributions of north american bird species moving north as a result of climate change monitoring species abundance and distribution at the landscape scale presence-absence versus abundance data for monitoring threatened species incorporating s gene copy number information improves estimates of microbial diversity and abundance a framework for delineating biogeographical regions based on species distributions poleward shifts in winter ranges of north american birds numerical ecology novel communities from climate change simultaneous vegetation classification and mapping at large spatial scales estimating site occupancy rates when detection probabilities are less than one impact of a century of climate change on smallmammal communities in yosemite national park north american breeding bird survey dataset a globally coherent fingerprint of climate change impacts across natural systems multivariable geostatistics in s: the gstat package r: a language and environment for statistical computing ecological responses to habitat edges: mechanisms, models, and variability explained n-mixture models for estimating population size from spatially replicated counts reducing the effects of pcr amplification and sequencing artifacts on s rrna-based studies dealing with label switching in mixture models re-shuffling of species with climate disruption: a no-analog future for california birds improving the forecast for biodiversity under climate change decomposing biodiversity data using the latent dirichlet allocation model, a probabilistic multivariate statistical method individual movement strategies revealed through novel clustering of emergent movement patterns a network approach for identifying and delimiting biogeographical regions novel climates, no-analog communities, and ecological surprises survey of clustering algorithms extending the latent dirichlet allocation model to presence/absence data: a case study on north american breeding birds and biogeographical shifts expected from climate change key: cord- -hddxaatp authors: howard, daniel title: genetic programming visitation scheduling solution can deliver a less austere covid- pandemic population lockdown date: - - journal: nan doi: nan sha: doc_id: cord_uid: hddxaatp a computational methodology is introduced to minimize infection opportunities for people suffering some degree of lockdown in response to a pandemic, as is the covid- pandemic. persons use their mobile phone or computational device to request trips to places of their need or interest indicating a rough time of day: `morning', `afternoon', `night' or `any time' when they would like to undertake these outings as well as the desired place to visit. an artificial intelligence methodology which is a variant of genetic programming studies all requests and responds with specific time allocations for such visits that minimize the overall risks of infection, hospitalization and death of people. a number of alternatives for this computation are presented and results of numerical experiments involving over people of various ages and background health levels in over visits that take place over three consecutive days. a novel partial infection model is introduced to discuss these proof of concept solutions which are compared to round robin uninformed time scheduling for visits to places. the computations indicate vast improvements with far fewer dead and hospitalized. these auger well for a more realistic study using accurate infection models with the view to test deployment in the real world. the input that drives the infection model is the degree of infection by taxonomic class, such as the information that may arise from population testing for covid- or, alternatively, any contamination model. the taxonomy class assumed in the computations is the likely level of infection by age group. the quaranta giorni or forty day isolation by the venetians, as the name implies, was a measure applied to incoming ships [ ] which evolved into containment practices to handle recurrent epidemics . at the time of writing, owing to ubiquitous world travel, covid- 'quarantines' or lockdowns keep millions of people around the world confined mostly to the home for months before an 'easing of measures' gradually re-opens society. the lockdowns in response to the covid- pandemic enforce in different ways. argentine and spanish lockdowns are strictly policed with citizens required to make written application for outings. in contrast, in north western europe and the united states, lockdows are not as strict, with denmark and the united kingdom entrusting their citizens not to infringe lockdown rules, and liberal sweden choosing not to lock down in an official sense but instead practicing a small number of restrictive measures. the manner of lockdown and how to exit a lockdown are problems that are short of informed solutions. it is useful to discuss the generic lockdown problem as requiring an innovation capable of overcoming the following trade-off: generally, the longer and more extensive is a lockdown the more effective becomes society's ability to prepare hospital facilities to control the spread of the disease but the higher are important negative factors: loss of personal freedoms; damage to the economy; poorer personal psychology; social unrest; abusive relations; undetected crime; higher incidence of other disease because people are too scared to visit the emergency room or doctor and negative effects on care of both vulnerable and elderly. the idea is to enable lockdown whilst minimizing many of its negative consequences, e.g., to personal psychology, economics and health. indeed, it would be nice if life could continue as normal while in lockdown, a seemingly contradictory statement. inspiration comes from a crude approach taken by governments to ensure that those who venture outdoors are fewer in number. panama [ ] used the last number of an identity document to assign two hour time slots to venture away from the home for essentials and, as spain eased measures, it allowed certain age groups to go out at different times of the day. the open literature also explores changing a general lockdown to a number of partial ones [ ] . the solution presented here is for citizens in lockdown to enter into smart-phone, handset, tablet or computer, a schedule of the places that they wish to visit on that day, or future dates, together with a rough idea of the part of the day they would prefer for such outings to take place. a method of optimization, in this proof of concept this is a genetic programming [ ] method, takes these requests and simulates the outings by means of an infection model, to discover a nearly optimal allocation of precise time slots for visits that reduce the likely hospitalization and death numbers. it then communicates the time allocations to citizens on their devices such that they will carry out the journeys more safely than at any times of their choice. the solution is proactive. specifically, it is useful to discuss the problem within the implementation of this proof of concept. a data file captures all requests as shown on the left side of figure . the visit requests on each of three consecutve days can be many, and each is denoted by a three symbol key. requests are separated by the colon symbol. the first letter concerns a broad time of day requested for the visit limited to: m wishes for the visit to take place during 'morning' hours; p wishes for the visit to take place during 'afternoon' post meridiem hours; n wishes for the visit to take place during 'night' hours; a anytime: does not mind at what hour on this day. and the second letter is the desired place of visit, limited here to six types of establishment: f may represent a 'supermarket' selling food; c may represent a sports 'club'; p could represent a 'park'; d may stand for 'doctor' a doctor's surgery centre; r may stand for 'restaurant'; s could represent a 'social' establishment; the third symbol can be or , because there are only two of each type of establishment, or a total of establishments available for visits. for example, one such request is from person id who wants to go to doctor's surgery at any time of day. the day is divided into eight two hour slots. after analysis of all request, by working with an infection model, the computer generated optimization minimizes the total risk of covid- infection, hospitalization and death. it allocates the time slots to the requests and the solution is communicated back to citizens. in this case person id is allocated a time slot within the constraint that they imposed as shown on the right side of figure . there are other data fields such as age and health in the left side of the figure that are discussed further in this text. the proof of concept simulations cover three consecutive days denoted as: monday, tuesday and wednesday, involving visits as requested by people. a day consists of eight two-hour visitation slot periods available to schedule the visits. details of the data for the purpose of the approximate reproduction of results is as in table . the data is an entire fabrication but common sense governed its choices: older people carry out relatively fewer outings than the young, and are in a poorer state of health. the degree of health of a person is a number that ranges from to . in any future real-world application of this research, the general health of a person, which is a measure of their immune system response to the pandemic and assumed to abate the probability of hospitalization or death could be gathered from patient records. as seen in later sections, this level of health combines with infection and plays a pivotal role in determining which solutions are better than others. the idea is to reduce hospitalizations and fatalities. the solution must simulate the infection process cumulatively and longitudinally in time as people go from place to place to optimize the allocation of time slots. this requires as component a model of covid- infection. the proof of concept develops a model based on person to person transmission based on simple probabilities of meetings between people in a confined location and is necessarily a very basic model but one that illustrates the potential of the solution. two models are presented: a partial infection probability developed for this work and a simple standard probability model. both are simple and would need to be improved by epidemiologists for their real world application. the idea of a 'partially infected' individual is developed here because it will only be possible in some average sense to know who might be infected . the idea is to assign a 'partial infection level' to all members of a taxonomy class of interest, and then simulate how they will infect others and further infect each other. if we knew precisely who was infected they could be isolated and the solution becomes unnecessary age total number of people if an infected individual i is in close proximity to a susceptible individual s then the possibility exists of transmission of the disease from i to s. without loss of generality the assumption is made that every such encounter will result in infection transmission. an infection probability based on a count of such encounters between s and one or more i can be expressed as p n where n is the number of infected who may come into contact with susceptible in a fixed interval of time that denotes the duration of a visit, e.g., an hour or two. more sophisticated relations should consider this time dimension and may model it with poisson distributions but such complications are ignored for the purpose of this presentation. if the location of the encounters is for example, a store of a certain physical size, and s is the number of sublocations of that are available to visitors that together comprise the walkable area of that store, they are sub-units of area small enough such that people may position themselves and come into close proximity of each other, then a simple count of probability for such encounters between s and one or more i leads to: p n = − ((s − )/s) n . a property of this relation is the convergence to one as n grows large: n → ∞ then p ∞ → . meaning that a non-infected person will surely come into contact with one or more infected at that store. moreover, also p < p such that if doubling i then the probability of infection for s is increased but never doubled. in driving a contact based infection model, simulations must assume a certain number of persons are infected a priori at the start of the simulation. this presents a challenge because it requires assumptions of who is infected and why, and also a huge number of stochastic computation and an averaging of results. as each person is different and not a clear member of a taxonomy class , for example, people may be of the same age but of different health level granting them less resistance to the infection, the computations would necessitate carefully balancing assumptions about who should be assumed to be infected to drive the computations. the partial infection model that is presented, while not being itself a perfect solution, does not have this the idea will not be unfamiliar to mathematical biologists and epidemiologists [ ] onerous requirement and simplifies the computational effort requirement for this proof of concept study because it can take an idea about how infected are members of the population, assigning a probability of infection to all members of that taxonomy grouping in order to drive the infection simulation. the concept of a partially infected individual is a modelling tool. each person p j is represented by a vector of size two, p j = (s j , i j ) with s j + i j = . for example a person with id label j = that is forty percent infected is represented by s = . and i = . and another with id label j = who is one percent infected by s = . and i = . . consider the meeting at the store of n p persons of which n have some degree of partial infection. the resulting infection pressure, pi n , that is the resultant partial infection probability of the encounter that is brought about by the infection contributions of the n partially or fully infected persons, can be obtained for two cases: s max is the maximum s j out of all n persons. with the exception of the owner of s max each i k of the other partially or fully infected persons is multiplied by s max and by a multiplicative numerical constants g j soon to be discussed. these product terms are summed to obtain the probability of infection for the encounter. when n p = n, the number of such product terms is therefore n − . however, when n p > n, s max is considered to be from one of the fully susceptible persons in n p and thus s max = . now the number of such product terms in the sum is n. a formula to compute pi n given n p = n with every participant infected or partially infected, is developed as and for the case n p > n with s max = as the constants g j emanate from simple overlap counts in probability trees as in appendix b. and this author's earlier technical communication [ ] . deliberately the products are arranged or ordered so that the largest g j corresponds to the largest i j . this represents the worst infection case which subsumes all others. once the probability of infection pi n is calculated, the partial infection of all participants in n p is updated for encounter v in readiness for the next visit v + as follows: as an example, assume s = and n p = n the infection probability indeed follows a similar simple 'probability count' philosophy as in the previous section but as obtained with a monte carlo process that generates the probabilities. each evaluation involves q trials for one fully susceptible s and twenty fully infected i persons and it counts the times when the susceptible is co-located with any infected. for example, it does this on separate occasions, i.e., once these forty trials are computed for the one susceptible and the twenty infected with results as in figure . the procedure checks to see whether a susceptible and one or more infected shared an encounter from those forty, (q = sampling). it is however fashioned in a more elaborate way to try to account for differing times spent in different sub-areas. the random sampling is made to fit into seconds of time (two hours) but the random number indicating the location is chosen to weigh certain locations more. it crudely captures that some areas of the store or establishment are more popular than others. for example, browsing a magazine rather than simply walking through some area of the store. random numbers are clustered to represent three types of areas: one where the person spends seconds or another where the person spends seconds, and yet another where the person spends seconds in the selected area. the procedure is outlined in appendix a. the constants in the figure are calculated prior to the run and give the probability of susceptible getting infected depending on how many people are infected in the store. all visits are of the same duration so no attempt is made to model probability of infection through time with a poisson process. this model of infection transmission is once again a very simple one. when n > the probability of infection is assumed to be equal to one. note that a lower or higher rate of infection can be obtained by changing the sampling from to a smaller or larger number. at any meeting place and time, three types of person can participate in the encounter: i or infected; s or susceptible and r or immune (recovered). no action is taken if fewer than two people participate n p < . no action is taken if all are immune. no action is taken if all are infected. no action is taken if all are susceptible. no action is taken if there are no infected. no action is taken if there are no susceptible. otherwise the aforementioned infection probability is selected for the number of infected and that real number is multiplied by the number of susceptible and then truncated to obtain the integer number that will become infected. in no particular order that many susceptible are labelled as infected. the method adopted to optimize the allocation of requested visits is a genetic programming (gp) scheme developed by this author to discover a set of precise numerical constants that serve as coefficients of a polynomial representing the direct solution of the one dimensional homogeneous convection diffusion equation [ ] . this and other puiblications [ ] [ ] demonstrate that standard genetic programming trees are capable of computing very precise real numbers when and if needed. when gp trees are evaluated they deliver a vector of real numbers of self-determined variable-length. table lists the functions and terminals that comprise the gp tree. they manipulate two pointers p r and p z to the output variable length vector of numbers. they also make use of two working memories, two real numbers, m and m . terminals are small and large numbers that become arithmetically manipulated. certain functions write to the variable length result vector, shrink it or expand it as the gp tree evaluates. the result of evaluating a gp tree is a variable length vector of real numbers. these numbers can be of any size and can be positive or negative. a subroutine then operates on these numbers to bound them as positive real numbers in size between . to . [ ] . for example, if a number in the vector is - . it now becomes . . also if a number is . it now becomes . . how is this vector of real numbers used? consider visitation requests of table . the real numbers vector is probably much smaller. this is consulted from left to right as when a child reads words letter by letter. consider the outing request by a person is ad , e.g. person monday in figure wishing to go to doctor's surgery at any time of the day. the day is divided, into eight two hour slots of time. as each number ranges from . to . consider that this might be . . this number would indicate prescription of the third slot of time since / < . < / . the allocation for that visit complete, the next real number in the variable length vector gets consulted to deal with the next visit request. as the variable length vector of numbers is usually shorter than the total number of visits requested, when the last element of the vector is reached it cycles back to the first number in the vector and continues until allocations for all visitations are dealt with, allocations of time as in figure = m setmem ( ) = r and m = (m + l)/ table : constituent functions or building blocks of gp trees in [ ] are of four varieties: number; arithmetic; record; and memory with number of arguments as shown in brackets: if ( ), one is l or'left' input and the other r or'right' input (corresponding to left and right branches of the subtrees below the node) and ( ) is a tree terminal or leaf. the resultant variable length vector of constants r has elements r j that must never exceed j = p max . two pointers are manipulated p r and p z current position and vector length respectively. working memory locations are two: m and m . in the above 'increment' means to increase by one, and 'decrease' to decrease by one. the symbol '=' denotes what the function returns to its parent node in the tree. small to moderate sized vectors. practical use of the method may require a small number of additional strategies. on some problems for a small number of initial generations the darwinian fitness is set as the vector size. it stimulates production of large output vectors. when a desired variable length vector output size is reached by all members of the population the fitness is reset. from then onwards the fitness is the problem's darwinian fitness, typically a measure of solution error. the procedure is often not necessary but can become useful when required solution complexity or dimension is very high. once seeded in this way the solution vector will grow or shrink as crossover and mutation operations on the gp tree create new and improved gp trees. as the method makes use of gp trees and standard gp, then all that we know about standard genetic programming including modularization options such as adfs [ ] , and subtree encapsulation [ ] are applicable. as will be revealed in the numerical experiments, the approach works, discovers good solutions, and appears to be quick on a portable computer, the compiled c++ visual studio executable delivering solutions in seconds for the proof of concept experiments. regardless of the infection model used the procedure is similar. there will exists a taxonomy of persons by some parameter(s), for example, age group. information about the likely degree of infection for different taxonomy classes is input to the computations. as time progresses and visits take place, people get infected. at the end of the three days, account is taken of who is infected and their prior health level. this calculates how many people in the proof of concept study will become infected a few weeks hence and how many may die. there are many assumptions but if correctly taken they drive genetic programming to discover allocations that reduce hospitalizations and fatalities. central to darwinian fitness are such calculations of who and how many will perish or fall very ill and use the intensive care unit (icu). in this implementation taxonomy is assumed to depend on age group. covid- testing is unlikely to become universal and frequent for all citizens but limited testing and other measurements are sufficient to serve as indications of the degree of infection likelihood for sectors, partitions, of the population. the proof of concept uses age as taxonomy. figure reveals how this works. if testing reveals that more people of age groups a or b have covid- than of age groups c or d, then percentages are entered at the start of numerical experiments. for example, = . ; = . means that the -year-olds have a suspected level of covid- infection of three percent, while the -year-olds are suspected to have one percent. each simulation proceeds from monday to wednesday and for each day from morning to night ( time slots). at each time slot all twelve establishments are considered and calculations of partial infection update the partial infection level of all. at the end of the day, a calculation is made to determine who should go into self-isolation and participate no longer in the simulation. the rules to identify those persons are presented in the left side of table . at the very end of the simulation another procedure calculates the total number of hospitalized in the intensive care unit (icu), denoted by the symbol n h . these are those in hospital who eventually recover but who never the less put pressure on the health service. the procedure also calculates those who unfortunately pass away, denoted by the symbol n d . the rules to determine these two numbers are presented in the right side of table . note the assumption is made here that both the state of health and the age of the subject correlate similarly but are assumed to be separate causal factors. the gp darwinian fitness function f f is the measure of solution goodness. it combines n h and n d weighted by some desirability constant w c . f f is by choice a negative quantity because the evolution is set to maximize or make bigger this quantity, with a perfect score being zero implying n h = n d = : all of the numerical results in this proof of concept use w c = . , reflecting desire to reduce fatalities. where w > w > w > reflects that the mortality of the young from covi-d is low and that of the elderly is very high. however, it suffices to show by the small worked out example of figure (and as seen in the figures) that lower average partial infection does not necessarily reflect in smaller values of n h and n d . in this type of model persons can only become fully infected, i. hence, the rules to determine n h and n d are different and only apply to those infected. additionally, the decision to self-isolate depends on the number of days that the person is infected and their health level. consider that some of the people in the input data file are already fully infected. table shows the decision schema for self-isolating persons as well as for deciding on n h and n d after the simulation is complete. note the assumption is again made here that both the state of health and the age of the subject correlate similarly but are assumed to be separate causal factors. the fitness measure for these computations is also equation . the numerical experiments compare the solution to three round robin uninformed allocations. the first of these comp sends all those: (a) requesting 'morning' visits or 'any time' visits to the first morning time slot, the : there is also a third uninformed allocation by round robin for comparison, comp . it is a round robin of three. however, as the morning has only two time slots this sends two thirds of the requests for case (a) to the first slot and the other half to the second slot. for cases (b) and (c) it uses all three time slots distributing the visitation requests evenly. comparison could be made to other allocations but these are broadly representative of what would transpire in the real world without the benefit of the solution and in conditions of partial lockdown or no lockdown. it turns out, as expected, that comp results in the highest number of hospitalizations n h and deaths n d for the model problem and comp in the lowest as revealed in comparative experiments presented in this section. each experiment follows common gp practice executing a large number of parallel independent runs (pirs). each pir differs in its initial random seed (uses pc clock timer) seeding the population randomly and differently for each pir. pirs can have different population sizes of gp trees. all runs use percent crossover and percent mutation to generate new gp trees. this is a steady-state gp with tournament selection of four individuals to select a strong mate for crossover, whereby two gp trees swap branches, or the winner of the tournament simply mutates, with a kill tournament of two to choose the weaker gp tree to replace. the maximum possible tree size for gp is set at and if this is exceeded then the shorter side of the crossover swap is taken. the maximum variable length vector size is set at , but never remotely approached: excellent solutions have vector sizes of between and . it is an 'elitist' gp for it does not destroy the fittest solution, i.e., solutions do not have a lifespan inside a pir. all pirs implement standard 'vanilla' gp with no parsimony pressure or any other non-standard approach. this proof of concept study does not employ explicit reuse: adf [ ] or subtree encapsulation [ ] . when a better solution emerges during the execution of a pir, this is separately stored. a pir typically only for persons who become infected self-isolation conditions outcome rules age days infected health age health actions outcomes produces up to around ten of which two are highly fit and interesting. as subsequent pirs produce more solutions, all become ranked by fitness, a global list of solutions, from which the user can select one to inspect. for each solution, full details of the participants to all visitations, infection levels, numbers self-isolating, in icu and sadly passed away can be inspected, as well as the variable length vector of real numbers that is the solution and visitation schedule and identity, age and level of infection of all participants recovered, in icu or deceased. for the partial infection experiments it also computes partial infections by groups: young, middle aged and elderly every four hours. the comprehensive result panes in figures in this section merit closer inspection. as pirs build, a pareto front picture can emerge of non-dominated solutions involving the two criteria n d and n h . table shows the vast superiority of the solution to the uninformed round robin allocation schemes. the superiority is marked and so there is no need to check this against other possible random allocation schemes. it makes common sense that the genetic programming solution is always superior. note that with s = the problem becomes easier than with s = because the chances for meetings between people are lower. this can be seen to be true for both round robin allocations and genetic programming solutions. it also appears to be that the superiority of genetic programming solutions over round robin allocations is correlated with higher s. this is probably because the genetic programming has more degrees of freedom to discover a better solution. therefore, even if the intuitive idea is that with a smaller density of people there should table : summary of results attained by the numerical experiments that made use of the partial infection model. with s = the solution is three times better than a round robin allocation and with s = it is ten times better. be less need for the solution, this is not always the case. note that in the cases where s = , genetic programming managed to discover an allocation where n d = , with no deaths. perhaps such a counter-intuitive conclusion can be discerned that the solution is needed more so when the problem seems simpler. close inspection of the figures referred to in table reveals that the system of pirs delivers a number of pareto non-dominated solutions to choose from. also, there are solutions that achieve an identical result in terms of n d and n h but which are quite different. then one is also able to inspect the age and health of the fatalities, again this may be a tertiary factor that could come into play when selecting among the many equivalent solutions. finally, it can also be discerned that the average partial infection level and its final levels as shown in the figures do not always correlate with lower error and higher darwinian fitness. a different set of computations to the real world problem is included here for completion. it pertains to knowing who is infected a priori and trying to optimize what could have happened had their visits been scheduled differently. it could either be useful to a retrospective study or, alternatively, such computations can potentially address the same real world problem but only by the carrying out of a plethora of experiments with different seeds of infected individuals, assumed infected according to some taxonomy knowledge of level of infection in the population, and then averaging the results in some way to prescribe safer allocations of visits, as alternative to the experiments (presented in the previous section) with the partial infection model. a set of persons, as shown in figure , out of the in the proof of concept data, as described in section , come already fully infected a priori to drive the computations . this is about . % of total people, and they undertake visits or . % of the total visits that appear in the proof of concept dataset. individuals with good health levels are chosen as already infected. for completion, the figure also shows six people who have immunity and therefore cannot become infected in the computations. they are included in computations but there is no need for them. it is envisaged that the optimization problem by gp is challenging. hence, the infection model is implemented at different values of q (see section . and appendix a) to allow the solution room to make gains but also to understand the dynamics under various levels of contagion (perhaps representing adherence to social distancing and use of masks). there are eight types of combinations of the events that can befall a person: . person comes to the simulation already infected and does not use the intensive care unit (icu); . person comes to the simulation already infected and does make use of the icu (adds to n h ); . person comes to the simulation already infected passes through the icu or not and dies (adds to n d ); . person comes to the simulation immune (has had the disease before and has recovered) and stays that way; . person comes to the simulation susceptible and does not get infected; . person comes to the simulation susceptible, gets infected but does not use the intensive care unit (icu); . person comes to the simulation susceptible, gets infected and does make use of the icu (adds to n h ); . person comes to the simulation susceptible, gets infected, goes or not to icu and dies (adds to n d ); figure illustrates how to interpret that part of the result figures pointed to in table . as there are generally no cases of infected ending in icu or dying only six result summaries show as in the figure. the results of all numerical experiments that use the full infection model are as in table . the difficulty with this infection model is that it multiplies the probability of infection by the number of susceptible and then it simply truncates the result to the lower integer. moreover, it simply goes down the list of susceptible infecting the first bunch it sees up to that integer number. this gives the solution opportunities to play games. for example if q = then p( ) = . and if there is only one susceptible and nine infected then the susceptible will not become infected. notwithstanding the weaknesses of such computations and their erroneous assumptions this can be said about the table figures. as q is increased then the gp search becomes much harder. best solutions come from bigger gp populations run for longer (longer numbers of generations). also the advantage of the solution over the round robin assignments is less valuable than at lower q values. this is to be expected because at high q the disease is far more contagious and there is not much room for the optimization. note that with q = all round robin solutions have the same n h and n d possibly indicating little room for schedule optimization by the solution. table : summary of results attained by the numerical experiments that made use of the full infection model. with q = ( see appendix a and the text), the solution is between two and three times better than simple round robin allocations. large improvements in both lower mortality and lower use of the icu are available with the solution for the proof of concept data described in section in computations with both infection models. the improvements as compared to round robin allocations in tables: table and table clearly show it. the parameters s and q respectively in the two numerical experiments have a similar but inverse influence as the former gives the number of possible sub-locations that can be occupied while the latter is the number of checks performed inside the monte carlo calculation that detects co-presence. however, even at low s and at high q, the worst possibilities, the solution keeps outperforming round robin by a significant margin. at high s and low q the solution really shines and outperforms round robin by a very considerable margin. it augers well because with good cleaning at locations, washing of hands and even moderate social distancing, the rate of infection is expected to be low (corresponding to higher s and lower q in a sense). of course, a more serious real-world model needs to be considered in further research. such a model would need to consider: . is it correct that a difference in infection rates exists between taxa? if so, what is this taxonomy and how to construct it? is it possible to discover this taxonomy through covid- testing and other data collection? . is it possible to develop a reasonably accurate infection model for certain stores and places that people wish to, or need to, frequent? . the infection model must also include a dynamic related to object contamination and transmission through contact of surfaces and objects; . travel by public transport to such locations also needs to be accounted for by the model; . is the round robin a fair reflection of how people wish to go out in the unrestricted normal case? are there invariant principles gathered from mobile phone roaming data that could inform how and when people go out? . what is the effect of non-compliance on the solution? could the solution account for it and still be gainful? . the solutions arrange into pareto non-dominated sets involving n d and n h : what is the difference between these and also between equivalent solutions, same score of f f ? is there a difference between such solutions of further interest? . the solution is designed to work even with very little idea or precision about the rates of infection and contamination as it aims for an improvement rather than precise values. how valid are these assumptions? . the solution out pefoms round robin but how does it fare in terms of infection rate against strict lockdowns, or exiting lockdown with phased lockdown strategies suggested in [ ] ? . can economic, psychological and other benefits of the solution be quantified to understand the cost benefit analysis of adopting it versus strict lockdown policies? . will people be willing to undergo numerous lockdowns waiting for the availability of a vaccine if the solution is adopted? genetic programming is known to scale well with problem size. however, even if millions of people were considered there would be a degree of clustering that could be treated differently by different discovered solutions. the partial infection model is developed here for the first time. if there is something similar in the literature then this author does not know of it. it must be tested and developed better. it is possible that the pessimistic approach of multiplying the terms of worst case is not entirely reasonable. lastly, how well infections can be avoided will depend on the number of visitations, the number of establishments visited, the number of people and the frequency of visits. in this research genetic programming was handed a tough challenge as the number of visitations was in the order of and the times slots very few per day. also the number of establishments was only a few. in general, it is said that washing one's hands is far more effective than social distancing. it is probably true that contamination is more important than person to person transmission. contamination can be easily incorporated into the model. although the model is incipient, it is considered something few have considered if any. most research is involved in exploiting data sources to predict infection levels or explain the disease. this contribution is different in nature because its deployment could generate data and would also need only tendency of disease data. this contribution is markedly different from contact tracing approaches that are reactive. the work described here and its implementation would be proactive but it could also inform and be informed by contact tracing. sophisticated relations can be determined in function of the number of encounters between a susceptible and one or more infected that also incorporate time of exposure modelled as a poisson process [ ] . an important characteristic of such relations is that as the number of infected individuals n grows then p n increases but not linearly, so that for example one can expect p < p . this proof of concept assumes a trivially simple infection function based on counting co-locations between s and i. these increase with the number of infected but the ratio of these to the total possible encounters never exceeds one, that is, as n → ∞ then p ∞ → . . the essence of this behaviour can be crudely emulated with monte carlo or, as discussed here, by counting on simple probability trees of figure . we ignore dependency on time of exposure to assume all visits to places are of similar duration. the genetic programming approach that uses such infection functions, however, is general and able to incorporate any candidate infection function. from this figure, first, consider the case of two individuals only: p and p . individual p is susceptible while p is infected. the number of encounters between these two, the red boxes at the p level in the figure, or where both individuals share a same location, is four and the total number of possible outcomes is sixteen. hence, the opportunity for an encounter taking place and therefore for infection, assuming each individual should have an equal presence for each of the four locations and spend the same amount of time at any visited, is p = . . next consider the case where three individuals participate with p susceptible and p and p both infected. now we recognize opportunities for all three individuals to exist at the same location, and also opportunities for p to share a location with either infected p or p . if we count such opportunities we arrive at p = / = . and we verify that p < p . for the particular scenario of figure the coarse emulation leads to a simple relation to obtain the probability of encounter for any n that is: p n = − ( / ) n . considering rather large n it tends to one: p = − ( / ) = . . we now introduce perhaps what may be an unusual idea. the real world objective of this initiative is to leverage off covid- testing. imagine that upon testing for covid- the incidence is found to be higher in age groups and . it is unlikely that covid- testing will be undertaken by all people all of the time. moreover, tests are still unreliable. hence, we need to work with partial knowledge. an option to explore might be to work directly with the probable levels of infection that are informed by the testing for different groups of people as organized in some taxonomy: for example, consider the probability of infection to be percent and percent respectively and zero in other age groups? . it motivates use of a 'partial infection' but why? if we knew who was or was not infected we would let them out or keep them in isolation. however, as we cannot test every single person, it is useful to assign to them uncertainty, a probability level . in such a case, we must consider that any of the people in figure namely p , p or p may be partially infected (and partially susceptible). figure shows the encounters between persons, i.e., two people sharing one location ( , , , ) at the same moment in time, thus coming into contact with each other. we are especially interested in two encounters: p -p and p -p as we can assume that one person p is susceptible while the other two are infected. note that under permutations assigning i and s there is no need to count encounters p -p as these would represent two infected, and neither of interest are encounters between two susceptible. the thirty two encounters that determined the probability of infection result in the union that gives twenty eight encounters in the figure note that in spite of p already carrying a small probability of infection, its new infection level . < . . this is because others were not one hundred percent infected and the result was close to but less than . . here is a third example, consider the interactions between three people with partial infection: notice that p and p have an infection level . > . this is because the people were already one percent infected. an example involving four people is illustrated with the help of figure , consider the interactions between three individuals with partial infection: note from previous examples that if we simply identify the participant with the highest component of s then that will be the one that indicates the highest contribution. in this case that is p and so siii should be highest, and it is and is . . this number is very close to p = − ( / ) = . . it is a higher value because all participants contribute a significant level of infection. now we compute the new partial infections for our four participants: these factors can be seen to be in descending order. if n p people participate in an encounter but the number of partially infected people n is less than n p then we assume one fully susceptible individual with i = and s = will meet with the n partially infected individuals and thus s = is used in the formula. the calculation follows this ten step procedure (pseudo-code): identify s max and that participant with the highest value of s v prepare the products (s max i v j g j ) and sum them to get the infection level . use it to update infection levels of all n p participants: i v+ = ps v + i v predicting the structure of covert networks using genetic programming, cognitive work analysis and social network analysis genetic programming of the stochastic interpolation framework: convection-diffusion equation genetic programming visitation scheduling in lockdown with partial infection model that leverages information from covid- testing genetic programming solution of the convection-diffusion equation differential susceptibilty epidemic models genetic programming: on the programming of computers by means of natural selection genetic programming ii: automatic discovery of reusable programs .'s organización internacional para las migraciones panamá. cuarentena total en suelo panameño evolving modules in genetic programming by subtree encapsulation a phased lift of control: a practical strategy to achieve herd immunity against covid- at the country level. biorxiv the origin of quarantine a pseudo-code for the monte carlo routinethe following pseudo-code with q = produces the numbers of figure : do m, times (iterate over infected) ninfection[m] = (counts the encounters for this many infected) end iteration index m do , times (use a big number for good accuracy) do k, q times (typicaly q is between key: cord- -tgka pl authors: tovo, anna; menzel, peter; krogh, anders; lagomarsino, marco cosentino; suweis, samir title: taxonomic classification method for metagenomics based on core protein families with core-kaiju date: - - journal: biorxiv doi: . / . . . sha: doc_id: cord_uid: tgka pl characterizing species diversity and composition of bacteria hosted by biota is revolutionizing our understanding of the role of symbiotic interactions in ecosystems. however, determining microbiomes diversity implies the classification of taxa composition within the sampled community, which is often done via the assignment of individual reads to taxa by comparison to reference databases. although computational methods aimed at identifying the microbe(s) taxa are available, it is well known that inferences using different methods can vary widely depending on various biases. in this study, we first apply and compare different bioinformatics methods based on s ribosomal rna gene and whole genome shotgun sequencing for taxonomic classification to three small mock communities of bacteria, of which the compositions are known. we show that none of these methods can infer both the true number of taxa and their abundances. we thus propose a novel approach, named core-kaiju, which combines the power of shotgun metagenomics data with a more focused marker gene classification method similar to s, but based on emergent statistics of core protein domain families. we thus test the proposed method on the three small mock communities and also on medium- and highly complex mock community datasets taken from the critical assessment of metagenome interpretation challenge. we show that core-kaiju reliably predicts both number of taxa and abundance of the analysed mock bacterial communities. finally we apply our method on human gut samples, showing how core-kaiju may give more accurate ecological characterization and fresh view on real microbiomes. modern high-throughput genome sequencing techniques revolutionized ecological studies of microbial communities at an unprecedented range of taxa and scales ( , , , , ) . it is now possible to massively sequence genomic dna directly from incredibly diverse environmental samples ( , ) and gain novel insights about structure and metabolic functions of microbial communities. * correspondence should be addressed to dr. suweis. email: suweis@pd.infn.it one major biological question is the inference of the composition of a microbial community, that is, the relative abundances of the sampled organisms. in particular, the impact of microbial diversity and composition for the maintenance of human health is increasingly recognized ( , , , ) . indeed, several studies suggest that the disruption of the normal microbial community structure, known as dysbiosis, is associated with diseases ranging from localized gastroenterologic disorders ( ) to neurologic illnesses ( ) . however, it is impossible to define dysbiosis without first establishing what normal microbial community structure means within the healthy human microbiome. to this purpose, the human microbiome project has analysed the largest cohort and set of distinct, clinically relevant body habitats ( ) , characterizing the ecology of healthy human-associated microbial communities. however there are several critical aspects. the study of the structure, function and diversity of the human microbiome has revealed that even healthy individuals differ remarkably in the contained species and their abundances. much of this diversity remains unexplained, although diet, environment, host genetics and early microbial exposure have all been implicated. characterizing a microbial community implies the classification of species/genera composition within the sampled community, which in turn requires the assignment of sequencing reads to taxa, usually by comparison to a reference database. although computational methods aimed at identifying the microbe(s) taxa have an increasingly long history within bioinformatics ( , , ) , it is well known that inference based on s ribosomal rna (rrna) or shotgun sequencing vary widely ( ) . moreover, even if data are obtained via the same experimental protocol, the usage of different computational methods or algorithm variants may lead to different results in the taxonomic classification. the two main experimental approaches for analyzing the microbiomes are based on s rrna gene amplicon sequencing and whole genome shotgun sequencing (metagenomics). sequencing of amplicons from a region of the s rrna gene is a common approach used to characterize microbiomes ( , ) and many analysis tools are available (see materials c the author(s) this is an open access article distributed under the terms of the creative commons attribution non-commercial license (http://creativecommons.org/licenses/ by-nc/ . /uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. and methods section). besides the biases in the experimental protocol, a major issue with s amplicon-sequencing is the variance of copy numbers of the s genes between different taxa. therefore, abundances inferred by read counts of the amplicons should be properly corrected by taking into account the copy number of the different genera detected in the sample ( , , ) . however, the average number of s rrna copies is only known for a restricted selection of bacterial taxa. as a consequence, different algorithms have been proposed to infer from data the copy number of those taxa for which this information is not available ( , ) . in contrast, whole genome shotgun sequencing of all the dna present in a sample can inform about both diversity and abundance as well as metabolic functions of the species in the community ( ) . the accuracy of shotgun metagenomics species classification methods varies widely ( ) . in particular, these methods can typically result in a large number of false positive predictions, depending on the used sequence comparison algorithm and its parameters. for example in k-mer based methods as kraken ( ) and kraken ( ) the choice of k determines sensitivity and precision of the classification, such that sensitivity increases and precision decreases with increasing values for k, and vice versa. as we will show, false positive predictions often need to be corrected heuristically by removing all taxa with abundance below a given arbitrary threshold (see materials and methods section for an overview on different algorithms of taxonomy classification). we highlight that the protocols for s-amplicons and shotgun methods are different and each has their own batch effects. importantly, while shotgun taxonomic analysis gives classification results at species-level, s taxonomic profilers most often need to stop at the genus level. however, in the end, both aim at answering to the same question: "what are the relative abundances of taxa in the sample?" therefore it is not methodologically wrong to compare their answers against the same community. to do that, it is possible to aggregate lower level (e.g. species) counts towards higher levels (e.g. genus), as it has been done in many benchmarks studies before (see, e.g., ( , , , ) ). in fact, several studies have performed comparisons of taxa inferred from s amplicon and shotgun sequencing data, with samples ranging from humans to studies of water and soil. logares and collaborators ( ) studied communities of bacteria marine plankton and found that shotgun approaches had an advantage over amplicons, as they rendered more truthful community richness and evenness estimates by avoiding pcr biases, and provided additional functional information. chan et al. ( ) analyzed thermophilic bacteria in hot spring water and found that amplicon and shotgun sequencing allowed for comparable phylum detection, but shotgun sequencing failed to detect three phyla. in another study ( ) s rrna and shotgun methods were compared in classifying community bacteria sampled from freshwater. taxonomic composition of each s rrna gene library was generally similar to its corresponding metagenome at the phylum level. at the genus level, however, there was a large amount of variation between the s rrna sequences and the metagenomic contigs, which had a ten-fold resolution and sensitivity for genus diversity. more recently jovel et al. ( ) compared bacteria communities from different microbiomes (human, mice) and also from mock communities. they found that shotgun metagenomics offered a greater potential for identification of strains, which however still remained unsatisfactory. it also allowed increased taxonomic and functional resolution, as well as the discovery of new genomes and genes. while shotgun metagenomics has certain advantages over amplicon-sequencing, its higher price point is still prohibitive for many applications. therefore amplicon sequencing remains the go-to established cost-effective tool to the taxonomic composition of microbial communities. in fact, the usage of the s rrna-gene as a universal marker throughout the entire bacterial kingdom made it easy to collect sequence information from a wide distribution of taxa, which is yet unmatched by whole genome databases. several curated databases exist to date, with silva ( , ), greengenes ( , ) and ribosomal database project (rdp) ( ) being the most prominent. additionally, ncbi also provides a curated collection of s reference sequences in its targeted loci project (https://www.ncbi.nlm.nih.gov/refseq/targetedloci/). when benchmarking protocols for taxonomic classification from real samples of complex microbiomes, the "ground truth" of the contained taxa and their relative abundances is not known (see ( ) ). therefore, the use of mock communities or simulated datasets remains as basis for a robust comparative evaluation of a method prediction accuracy. in the first part of this work we apply three widely used taxonomic classifiers for metagenomics, kaiju ( ), kraken ( ) and metaphlan ( ) , and two common methods for analyzing s-amplicon sequencing data, dada ( ) and qiime ( ) to three small mock communities of bacteria, of which we know the exact composition ( ) . we show that s rrna data efficiently allow to detect the number of taxa, but not their abundances, while shotgun metagenomics as kaiju and kraken give a reliable estimate of the most abundant genera, but the nature of the algorithms makes them predict a very large number of false-positive taxa. the central contribution of this work is thus to develop a method to overcome the above limitations. in particular, we propose an updated version of kaiju, which combines the power of shotgun metagenomics data with a more focused marker gene classification method, similar to s rrna, but based on core protein domain families ( , , , ) from the pfam database ( ) . our criterion for choosing the set of marker domain families is that we uncover the existence of a set of core families that are typically at most present in one or very few copies per genome, but together cover uniquely all bacteria species in the pfam database with an overall quite short sequence. using presence of these core pfams (mostly related to ribosomal proteins) as a filter criterion allows for detecting the correct number of taxa in the sample. we tested our approach in a protocol called "core-kaiju" and show that it has a higher accuracy than other classification methods not only on the three small mock communities, but also on intermediate and highly biodiverse mock communities designed for the st critical assessment of metagenome interpretation (cami) challenge ( ) . in fact we will show how in all these cases core-kaiju overcomes, for the most part, the problem of false-positive genera and accurately predicts the abundances of the different detected taxa. we finally apply our novel pipeline to classify microbial genera in the human gut from the human macrobiome project (hmp) ( ) dataset, showing how core-kajiu may allow for a more accurate biodiversity characterization of real microbial communities, thus putting the basis for more solid dysbiosis analysis in microbiomes. taxonomic classification: amplicon versus whole genome sequencing many computational tools are available for the analysis of both amplicon and shotgun sequencing data ( , , , , , , ) . one of the differences among the several software for s rrna analysis, is on how the next-generation sequencing error rate per nucleotide is taken into account, when associating each sampled s sequence read to taxa. indeed, errors along the nucleotide sequence could lead to an inaccurate taxon identification and, consequently, to misleading diversity statistics. the traditional approach to overcome this problem is to cluster amplicon sequences into the so-called operational taxonomic units (otus), which are based on an arbitrary shared similarity threshold usually set up equal to % for classification at the genus level. of course, in this way, these approaches lead to a reduction of the phylogenetic resolution, since gene sequences below the fixed threshold cannot be distinguished one from the other. that is why, sometimes, it may be preferable to work with exact amplicon sequence variants (asvs), i.e. sequences recovered from a high-throughput marker gene analysis after the removal of spurious sequences generated during pcr amplification and/or sequencing techniques. the next step in these approaches is to compare the filtered sequences with reference libraries as those cited above. in this work, we chose to conduct the analyses with the following two opensource platforms: dada ( ) and qiime ( ) . dada is an r-package optimized to process large datasets (from s of millions to billions of reads) of amplicon sequencing data with the aim of inferring the asvs from one or more samples. once the spurious s rrna gene sequences have been recovered, dada allowed for the comparison with both silva, greengenes and rdp libraries. we performed the analyses for all the three possible choices. qiime is another widely used bioinformatic platform for the exploration and analysis of microbial data which allows, for the sequence quality control step, to choose between different methods. for our comparisons, we performed this step by using deblur ( ) , a novel sub-operational-taxonomic-unit approach which exploits information on error profiles to recover error-free s rrna sequences from samples. as shown in ( ) , where different amplicon sequencing methods are tested on both simulated and real data and the results are compared to those obtained with metagenomic pipelines, the whole genome approach resulted to outperform the previous ones in terms of both number of identified strains, taxonomic and functional resolution and reliability on estimates of microbial relative abundance distribution in samples. similar comparisons have also been performed with analogous results in ( , , , ) (see ( ) for a comprehensive summary of studies comparing different sequencing approaches and bioinformatic platforms). standard widespread taxonomic classification algorithms for metagenomics (e.g. kraken ( ) and kraken ( ) ) extract all contained k−mers (all the possible strings of length k that are contained in the whole metagenome) from the sequencing reads and compare them with index of a genome database. however, the choice of the length k highly influences the classification, since, when k is too large, it is easy not to found a correspondence in reference database, whereas if k is too small, reads may be wrongly classified. recently, a novel approach has been proposed for the classification of shotgun data based on sequence comparison to a reference database comprising protein sequences, which are much more conserved with respect to nucleotide sequences ( ) . kaiju indexes the reference database using the borrows-wheeler-transform (bwt), and translated sequencing reads are searched in the bwt using maximum exact matches, optionally allowing for a certain number of mismatches via a greedy heuristic approach. it has been shown ( ) that kaiju is able to classify more reads in real metagenomes than nucleotide-based k−mers methods. therefore, previous studies on the community composition and structure of microbial communities in the human can be actually very biased by previous metagenomic analysis that were missing up to % of the reconstructed species (i.e. most of the species they found were not present in the gene catalog). we therefore chose to work with kaiju (with mem option ( )) for our taxonomic analysis. although it resulted to give better estimates of sample biodiversity composition with respect to amplicon sequencing techniques, we found that it generally overestimates the number of genera actually present in our community (see results section) of two magnitude orders, i.e. there is a long tail of low abundant false-positive taxa. to overcome this, we implemented a new release of the program, core-kaiju, which contains an additional preliminary step where reads sequences are firstly mapped against a newly protein reference library we created containing the amino-acid sequence of proteomes' core pfams (see following section). we also compared standard kaiju and core-kaiju results with those obtained via kraken and via another widely used program for shotgun data analysis, metaphlan ( , ) . after downloading the pfam database (version . ), we selected only bacterial proteomes and we tabulated the data into a f ×p matrix, where each column represented a different proteome and each row a different protein domain. in particular, our database consisted of p = bacterial proteomes and f = protein families. in each matrix entry (f,p), we inserted the number of times the f family recurred in proteins of the p proteome, n f,p . by summing up over the p column, one can get the proteome length, i.e. the total number of families of which it is constituted, which we will denote with l p . similarly, if we sum up over the f row, we get the family abundance, i.e. the number of times the f family appears in the pfam database, which we call a f . figure shows the frequency histogram of the proteome sizes (left panel) and of the family abundances (right panel). our primary goal was to find the so-called core families ( ), i.e. the protein domains which are present in the overwhelming majority of the bacterium proteomes but occurring just few times in each of them ( , ) . in order to analyze the occurrences of pfam in proteomes, we converted the original f ×p matrix into a binary one, giving information on whether each pfam was present or not in each proteome. in the left panel of figure we inserted the histogram of the family occurrences, which displays the typical u-shape, already observed in literature ( , , , ) : a huge number of families are present in only few proteomes (first pick in the histogram), whilst another smaller peak occurs at large values, meaning that there are also a percentage of domains occurring in almost all the proteomes. in the right panel, we show the plot of the number of rare pfam (having abundance less or equal to four in each proteome) versus the percentage of proteomes in which they have been found. we thus selected the pfams found in more than % of the proteomes and such that max p n f,p = (see zoom panel of figure ). since we wish to have at least one representative core pfam for each proteome in the database, we checked whether with these selected core families we could 'cover' all bacteria. unfortunately, none of them resulted to be present in proteomes and , corresponding to actinospica robiniae dsm and streptomyces sp. nrrl b- , respectively. we therefore looked for the most prevalent pfam(s) present in such proteomes. we found that pfam pf , occurring in % of the proteomes, was present in both actinospica robiniae and streptomyces and we therefore add it to our core-pfam list. eventually, in order to minimize the number of pfams to work with (and related computational cost), we considered in our final core-pfam list only the minimum number of domains through ribosomal protein l pf ribosomal protein l pf nusb family (involved in the regulation of rrna biosynthesis by transcriptional antitermination) pf ribosomal protein l pf ribosomal protein s (bacterial ribosomal protein s interacts with s rrna) pf mraw methylase family (sam dependent methyltransferases) pf ribosomal proteins l , c-terminal domain pf domain of unknown function (duf ) pf ef-p (elongation factor p) translation factor required for efficient peptide bond synthesis on s ribosomes pf ribosomal proteins s l /mitochondrial s l which we were able to cover the whole list of proteomes of the databases. in particular, the selected core protein domains for bacteria proteomes are the ten pfams pf , pf , pf , pf , pf , pf , pf , pf and pf (see table ). principal coordinate analysis. in order to explore whether the expression of the core pfam protein domains are correlated with taxonomy, we did the following. first, we downloaded from the uniprot database ( ) the amino acid sequence of each pfam along the different proteomes (see supporting information for details). their averaged (over proteomes) sequence lengths l resulted to be highly picked around specific values ranging from l = to l = (see supporting information, figure s , for the corresponding frequency histograms). second, for each family we computed the damerau−levenshtein (dl) distance between all its corresponding dna sequences. dl measures the edit distance between two strings in terms of the minimum number of allowed operations needed to modify one string to match the other. such operations include insertions, deletions/substitutions of single characters and transposition of two adjacent characters, which are common errors occurring during dna polymerase. this analogy makes the dl distance a suitable metric for the variation between protein sequences. by simplicity and to have a more immediate insight, we conducted the analysis only for sequence points corresponding to the five most abundant phyla, i.e. proteobacteria, firmicutes, actinobacteria, bacteroidetes and cyanobacteria. after computing the dl distance matrices between all the amino-acid sequences of each pfams along proteomes, we performed the multi dimensional scaling (mds) or principal coordinate analysis (pcoa) on the dl distance matrix. this step allow us to reduce the dimensionality of the space describing the distances between all pairs of core pfams of the different taxa and visualize it in a two dimensional space. in the last two columns of table we inserted the percentage of the variance explained by the first two principal coordinates for the ten different core families, where the first one ranges from . to . % and the second one from . to . %. we then plotted the sequence points into the new principal coordinate space, colouring them by phyla. in general, we observed a two-case scenario. for some families as pf (see figure , left panel), actinobacteria and proteobacteria sequences are grouped in one or two highly visible clusters each, whereas the other three phyla do not form well distinguished structures, being their sequence points close one another, especially for cyanobacteria and firmicutes. for other families as pf (see figure , left panel), all five phyla result to be clustered, suggesting a higher correlation between taxonomy and amino-acid sequences (see supporting information, figure s , for the other core families graphics). these results suggest that some core families (e.g. ribosomal ones) are phyla dependent, while other are not directly correlated with taxa. we started by testing shotgun versus s taxonomic pipelines on three small artificial bacterial communities generated by jovel et al. ( ) , whose raw data are publicly available (sequence read archive (sra) portal of ncbi, accession number srp ). these mock populations contain dna from eleven species belonging to seven genera: salmonella enterica, streptococcus pyogenes, escherichia coli, lactobacillus helveticus, lactobacillus delbrueckii, lactobacillus plantarum, clostridium sordelli, bacteroides thetaiotaomicron, bacteroides vulgatus, bifidobacterium breve, and bifidobacterium animalis. for the taxonomic analysis at the genus level through s amplicon sequencing, we evaluated the performance of dada ( ) and qiime pipelines ( ) . in particular, as shown in ( ), qiime produced more reliable results in terms of relative abundance of bacteria for all three mock communities when compared to mothur ( ), another widely used s pipeline, and to the miseq reporter v . , a software developed by illumina to analyze miseq instrument output data. as for shotgun libraries, we tested the standard kaiju ( ), kraken ( ), the improved version of kraken ( ) , and metaphlan ( ), the improved version of metaphlan ( ) . this latter relies on unique clade-specific marker genes and it had been shown to have higher precision and speed over other programs ( ) . eventually, we tested core-kaiju on these mock communities and compared its performance with the above taxonomic classification methods. we inserted, for each core family (pfam id, first column), the percentage of proteomes in which it appears (prevalence, second column), the maximum number of times it occurrs in one proteome (maximal occurrence, third column), the total number of times it is found among proteomes in the pfam database (total occurrence, fourth column) and the percentage of variance explained by the firs two coordinates (pco and pco , last two columns) when mds is performed on sequences belonging to the five most abundant phyla (see figure ). after defining the core pfams, we created two protein databases for kaiju: the first database only contains the protein sequences from the core families, whereas the second database is the standard kaiju database based on the bacterial subset of the ncbi nr database. the protocol then follows these steps: . classify the reads with kaiju using the database with the core protein domains . classify the reads with kaiju using the nr database to get the preliminary relative abundances for each genus . discard from the list of genera detected in ( ) those having absolute abundance of less than or equal to twenty reads in the list obtained in point ( ). this threshold represents our confidence level on the sequencing pipeline (see below). . re-normalize the abundances of the genera obtained in point ( ) . we evaluated the performance of both shotgun and s pipelines for the taxonomic classification of the three mock communities. in the top panels of figure we show the true relative genus abundance composition of the three small mock communities versus the ones predicted via the different tested taxonomic pipelines. we then applied the core-kaiju pipeline to detect the biodiversity composition of the same three mock communities. in figure , bottom panels, we plot the linear fit performed on predicted relative abundances via core-kaiju versus theoretical ones, known a priori. as we can see, in all three cases the predicted community composition was satisfactorily captured by our method, with an r value higher than . . our goal was to to quantitatively compare the performance of different methods in terms of both biodiversity and relative abundances. as for the first, we chose to measure it via the figure . comparison between theoretical and predicted relative abundances in small mock communities. top panels: predicted relative abundance composition of the three small mock communities via different taxonomic classification methods. bottom panels: red points represent data of relative abundance predicted for the genus level by core-kaiju on the three mock communities versus the true ones, known a priori. the green line is the linear fit performed on obtained points which, in the best scenario, should coincide with the quadrant bisector (dotted red line). in all three cases the predicted community composition was satisfactorily captured by our method, with an r-squared value of . , . and . , respectively. f score applied at the genera level. more precisely, we define the recall of a given taxonomic classification method as the number of truly-positive detected genera (present in a community and thus correctly detected by the method), t p , over the sum between t p and f n , the number of false-negative genera (present in a community, but missed to be classified). in contrast, we define the precision to be the ratio between t p and the sum of t n and f p , the number of false-positive genera (not present in a community and thus incorrectly detected as present). finally, the f biodiversity score is twice the ratio between the product of recall and precision and their sum, i.e. f = * t p /((t p +f n ) * (t p +f p )). f score values obtained via the different methods for the three analysed mock communities are presented in table . while f describes the overall accuracy in detecting the correct number of genera in the sample, r gives the correlation between the taxa abundance measured by the pipeline and the real composition of the microbial sample. finally, we also indicated the number of genera each method predicts,Ĝ. table summarizes the results of the analysis, together with the r-squared values, r , obtained for the linear fit performed between true and predicted relative abundances. as we can see, both core-kaiju and metaphlan gave a good estimate of the number of genera in the communities (which is equal to seven), whereas all s methods slightly overestimated it. finally, both standard kaiju and kraken predicted a number of genera much higher than the true one. moreover, fit with standard kaiju and core-kaiju of the predicted abundances displayed a higher determination coefficient with respect to all other pipelines, with the exception of kraken , which gave comparable values. however, if we focus on the f score, we can notice that core-kaiju outperformed all the other methods in terms of precision and recall. in particular, since the pipeline led to zero false-positive and only one false negative genus (e.coli in all three communities), the resulting precision and recall were and . for all the sampled mocks. with core-kaiju, we were therefore able to produce a reliable estimate of both the number of genera within the communities and their relative abundances. as stated in the introduction and observed above, metagenomic classification methods, such as kaiju, often give a high number of false-positive predictions. in principle, one could set an arbitrary threshold on the detected relative abundances, for example . % or %, to filter out lowabundance taxa that are likely false-positives. however, different choices of the threshold typically lead to very different results. the top panels of figure shows the empirical taxa abundance distribution of the genera table . f score, r-squared values and number of predicted genera. for all three analysed mock communities, we inserted the f score (twice the ratio between the product of recall and precision and their sum), the r value of the linear fit performed between estimated and true abundances together with the number of predicted genera,Ĝ, with various taxonomic methods. the true number of genera is g = for each community. mock (g = ) mock (g = ) , or if one considers only genera accounting for more than . %, . % and % of the total number of sample reads, respectively. moreover, looking at the empirical pattern, one can notice the main gap between genera covering a fraction of less than · − with respect to the total number of reads (black points) and those covering a fraction higher than · − (green points), which corresponds to the genera actually present in the artificial community. one could therefore hope that, whenever such a gap is detected in the taxa abundance distribution, this corresponds to the one between false-positive and truly present taxa. however, as will be clear in the following section, this is not the case and it is not possible to set a relative threshold for the shotgun methods that works for all the mock communities. we tested and compared standard kaiju, kraken and core-kaiju also on medium and high complexity mock bacterial communities obtained from the st cami challenge ( ) , in terms of biodiversity (recall, precision, f score,Ĝ) and abundance composition (linear fit r-squared). in table we show the results for samples and of the high-complexity dataset (see supporting information for the results of the other samples). as we can see, core-kaiju strongly outperformed the other methods in terms of precision. indeed, it only slightly overestimated the true number of genera of around taxa in sample , and taxa in sample (see table ), which is two order of magnitude lower with respect to the other methods (that predicted > of taxa). on the other hand, as also shown from the bottom panels of figure , when using in standard kaiju (or kracken ) a relative threshold of % so to reduce the number of false-positive taxa, as suggested by the previous analysis on the small mock community, the number of predicted taxa is in this case around , therefore strongly underestimating the real biodiversity of the samples. as for the recall, the performance of core-kaiju (values around %) stands between standard kaiju (values around %) and kraken (values around %). the combination of recall and precision led to an f score around %, much higher than the other two pipelines ( %). finally, as shown in figure , core-kaiju gave also a very good estimation of the microbial composition, with an r-squared for the fit between theoretical and predicted relative abundances above . , value comparable to standard kaiju and much higher than the one obtained with kraken ( . ). in the supporting information we present all the results for the other highcomplexity samples as well as the analyses performed on the medium-complexity challenge dataset and the sensitivity of the classification on the absolute thresholds. we finally applied core-kaiju taxonomic classification method to an empirical data-set. we analysed a cohort of healthy human fecal samples from the study ( ) (metagenomic sequencing data are publicly available at the ncbi sra under accession number srp ). we applied standard kaiju and found on average (over the samples) bacterial genera. similar overestimation of the number of taxa of kajiu . would be obtained also with kracken , highlighting the above mentioned problem of setting the correct threshold in order to have a realistic estimation of the sample biodiversity. the right panel of figure shows the empirical taxa abundance distribution of one individual (sample id: srr ). as we can see, in this case the only apparent gap occurs between relative abundance of less than − and those above . , with only one genus. it therefore results quite unrealistic that all the taxa but one should be considered falsepositive. the same plot shows the vertical lines corresponding to threshold on relative population of . %, . % and % above which we have , and taxa, respectively. in contrast, with core-kaiju we did not need to tune a relative threshold. instead, by removing false-positive through the (fixed) absolute abundance of reads we ended up with genera (orange diamonds in figure ) , which is compatible with previous estimates. in fact, the available ampliconsequencing datasets from stool samples of healthy participants of the human microbiome project ( ) suggest that there are on average different bacterial genera per sample (based on samples with at least > k reads per sample using % otu the red triangle corresponds to the unique false-negative genus (e.coli) undetected with the newly proposed method. dashed lines represent relative abundance thresholds on standard kaiju output of . %, . % and %, respectively, which would have led to a biodiversity estimate of , and genera, respectively. imposing an absolute abundance threshold of twenty reads on standard kaiju output directly, would instead lead to an overestimation of genera. bottom panels: the same analyses have been performed on the cami high-complex sample . again, green diamonds represent the out of genera present in the community and correctly detected by our pipeline. in this case, in addition to the remaining false-negative genera (red triangles) we have also the presence of false-negative genera, here represented by gray triangles. setting a threshold on the relative abundance of reads produced by standard kaiju gives a number of genera of for the . %, for the . % and for the % threshold, respectively. left and right panels represent, respectively, log-log absolute frequency and cumulative patterns of the taxa abundances in the mock communities. clustering). however, in terms of taxa composition, core-kaiju predicted abundances are different from those obtained using s classification methods ( ). an important source of errors in the performance of any algorithm working on shotgun data is the high level of plasticity of bacterial genomes, due to widespread horizontal transfer ( , , , , , ) . indeed, most highly abundant gene families are shared and exchanged across genera, making them both a confounding factor and a computational burden for algorithms attempting to extract species presence and abundance information. thus, while having access to the sequences from the whole metagenome is very useful for functional characterization, restriction to a smaller set of families may be a very good idea when the goal is to identify the species taxa and their abundance. to summarize, we have presented a novel method for the taxonomic classification of microbial communities which exploits the peculiar advantages of both whole-genome and s rrna pipelines. indeed, while the first approaches are recognised to better estimate the relative taxa composition of samples, the second are much more reliable in predicting the true biodiversity of a community, since the comparison between taxa-specific hyper-variable regions of bacterial s ribosomal gene and comprehensive reference databases allows in general to avoid the phenomenon of false-positive taxa detection. indeed, the identification of a threshold in shotgun table . performance comparison on cami high-complexity samples and . in the first four columns, we inserted the values for the precision, the recall, the f score, the r value of the linear fit performed between estimated and true abundances, and the number of predicted generaĜ with core-kaiju, standard kaiju and kraken . the true number of genera is g = for each sample. in the last column we also inserted the number of genera one would predict with standard kaiju and kraken by setting a relative threshold of %, i.e. by considering false-positive all those genera having a relative abundance of less than . in the sample. we denoted this quantity byĜ % . sample (g = ) figure . linear fit between theoretical and predicted relative abundances with core-kaiju. red points represent data of relative abundance predicted for the genus level by core-kaiju on sample and from the cami highly-complex dataset versus the ground-truth abundances, known a priori. the green line is the linear fit performed on such values which, in the case of perfect matching between data and cor-kaiju output, should coincide with the quadrant bisector (dotted red line). in both cases, the predicted community composition was satisfactorily captured by our method, with a correlation with the real taxa abundances of r = . and r = . for sample and , respectively. methods to remove most of the false-positive is of course a critical problem, because in general the true taxa composition is not known, and thus setting the wrong threshold may lead to a huge over-(or under-) estimation of the sample biodiversity, as shown in this work. inspired by the role of s gene as a taxonomic fingerprint and by the knowledge that proteins are more conserved than dna sequences, we proposed an updated version of kaiju, an open-source program for the taxonomic classification of whole-genome high-throughput sequencing reads where sample metagenomic dna sequences are firstly converted into amino-acid sequences and then compared to microbial protein reference databases. we identified a class of ten domains, here denoted by core pfams, which, analogously to s rrna gene, on one hand are present in the overwhelming majority of proteomes, therefore covering the whole domain of known bacteria, and which on the other hand occur just few times in each of them, thus allowing for the creation of a novel reference database where a fast research can be performed between sample reads and pfams amino-acid sequences. tested against mock microbial communities, of different level of complexity, generated in other studies ( , ) and available online, the proposed updated version of kaiju, core-kaiju, outperformed popular s rrna and shotgun methods for taxonomic classification in the estimation of both the total biodiversity and taxa relative abundance distribution. in fact, by fixing an absolute threshold with core-kaiju (by only considering abundances greater to twenty reads), we are able to correctly classify the biodiversity in all samples of different size and complexity, while keeping a very good performance in the prediction of taxa abundances. we highlight that other technologies exist beyond metagenomics or s amplicons on a miseq (integrated instrument performing clonal amplification and sequencing), as for example pacbio ( ). earl and collaborators ( ) used a cami dataset to test the accuracy of this method and it is therefore possible to indirectly compare core-kaiju with pacbio through their results. also in this case we found that our method gives a slightly higher r score for the genera abundances composition, confirming the competitiveness of core-kaiju even with long-read technology such as pacbio. however, a deeper comparison with these methods goes beyond the scope this work because, although might perform better than miseq next-generation sequencing approaches, they are quite rare and available only for much higher price. our promising results pave the way for the application of the newly proposed pipeline in the field of microbiotahost interactions, a rich and open research field which has recently attracted the attention of the scientific world due to the hypothesised connection between human microbiome nevertheless estimates from a reference cohort of stool microbiomes ( ) from healthy hmp participants ( s v -v region, > k reads per sample, % otu clustering), report an average number of genera per sample of (max= , min= ) ( ). setting a threshold on the relative abundance of reads produced by standard kaiju gives a number of genera of for the . %, for the . % and for the % threshold, respectively. in contrast, considering false-positive all genera with less or equal to twenty reads in standard kaiju output, we end up with genera. orange diamonds in plot correspond to the genera detected with core-kaiju, a number compatible with the reported estimates. left and right panels represent log-log absolute frequency and cumulative patterns, respectively. and healthy/disease ( , ) . having a trustable tool for the detection of microbial biodiversity, as measured by the number of genera and their abundances, could have a fundamental impact in our knowledge of human microbial communities and could therefore lay the foundations for the identification of the main ecological properties modulating the healthy or ill status of an individual, which, in turn, could be of great help in preventing and treating diseases on the basis of the observed patterns. all data and codes used for this study are available online or upon request to the authors. raw data for the three in-silico mock communities ( ) are publicly available at the sequence read archive (sra) portal of ncbi under accession number srp . metagenomic sequencing data of the healthy human fecal samples from the study ( ) are publicly available at the ncbi sra under accession number srp . cami medium and high complexity datasets are available at https://data.cami-challenge.org/participate under request. this work was supported by the stars grant unipd react to s.s. mcl, s.s. and a.k. acknowledge cariparo foundation visiting program . the human microbiome project the human microbiome project: a community resource for the healthy human microbiome tara oceans studies plankton at planetary scale viral to metazoan marine plankton nucleotide sequences from the tara oceans expedition. scientific data emergent simplicity in microbial community assembly the application of ecological theory toward an understanding of the human microbiome universality of human microbial dynamics community ecology as a framework for human microbiome research the integrative human microbiome project the human intestinal microbiome in health and disease the role of microbiome in central nervous system disorders structure, function and diversity of the healthy human microbiome shotgun sequencing of the human genome microbial community profiling for human microbiome projects: tools, techniques, and challenges phylophlan is a new method for improved phylogenetic and taxonomic placement of microbes large-scale differences in microbial biodiversity discovery between s amplicon and shotgun sequencing predictive functional profiling of microbial communities using s rrna marker gene sequences evaluation of general s ribosomal rna gene pcr primers for classical and next-generation sequencing-based diversity studies incorporating s gene copy number information improves estimates of microbial diversity and abundance quantitative microbiome profiling links gut community variation to microbial load copyrighter: a rapid tool for improving the accuracy of microbial community profiles through lineage-specific gene copy number correction microbiology: metagenomics evaluation of shotgun metagenomics sequence classification methods using in silico and in vitro simulated communities kraken: ultrafast metagenomic sequence classification using exact alignments improved metagenomic analysis with kraken genome biology characterization of the gut microbiome using s or shotgun metagenomics fast and sensitive taxonomic classification for metagenomics with kaiju metagenomic s rdna i llumina tags are a powerful alternative to amplicon sequencing to explore diversity and structure of microbial communities diversity of thermophiles in a malaysian hot spring determined using s rrna and shotgun metagenome sequencing strengths and limitations of s rrna gene amplicon sequencing in revealing temporal microbial community dynamics the silva ribosomal rna gene database project: improved data processing and web-based tools the silva and all-species living tree project (ltp) taxonomic frameworks greengenes, a chimera-checked s rrna gene database and workbench compatible with arb an improved greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea ribosomal database project: data and tools for high throughput rrna analysis metaphlan for enhanced metagenomic taxonomic profiling dada : high-resolution sample inference from illumina amplicon data reproducible, interactive, scalable and extensible microbiome data science using qiime joint scaling laws in functional and evolutionary categories in prokaryotic genomes cross-species gene-family fluctuations reveal the dynamics of horizontal transfers familyspecific scaling laws in bacterial genomes statistics of shared components in complex component systems the pfam protein families database in critical assessment of metagenome interpretationa benchmark of metagenomics software metagenomic microbial community profiling using unique cladespecific marker genes deblur rapidly resolves single-nucleotide community sequence patterns analysis of the intestinal microbiota using solid s rrna gene sequencing and solid shotgun sequencing estimating the size of the bacterial pan-genome zipf and heaps laws from dependency structures in component systems universal distribution of component frequencies in biological and technological systems a neutral theory of genome evolution and the frequency distribution of genes gene frequency distributions reject a neutral model of genome evolution uniprot: a hub for protein information introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities inflammation, antibiotics, and diet as environmental stressors of the gut microbiome in pediatric crohns disease the phylogenetic forest and the quest for the elusive tree of life. cold spring harbor symposia on quantitative biology search for a 'tree of life' in the thicket of the phylogenetic forest the tree and net components of prokaryote evolution genome-wide comparative analysis of phylogenetic trees: the prokaryotic forest of life genomic fluidity: an integrative view of gene diversity within microbial populations pacbio sequencing and its applications genomics species-level bacterial community profiling of the healthy sinonasal microbiome using pacific biosciences sequencing of full-length s rrna genes microbiome the gut microbiome in health and in disease rakoff-nahoum s. the evolution of the host microbiome as an ecosystem on a leash none declared. key: cord- - n pojb authors: zullo, federico title: some numerical observations about the covid- epidemic in italy date: - - journal: nan doi: nan sha: doc_id: cord_uid: n pojb we give some numerical observations on the total number of infected by the sars-cov- in italy. the analysis is based on a tanh formula involving two parameters. a polynomial correlation between the parameters gives an upper bound for the time of the peak of new infected. a numerical indicator of the temporal variability of the upper bound is introduced. the result and the possibility to extend the analysis to other countries are discussed in the conclusions. the coronavirus disease (covid- ) has been recognized in italy starting from st january , also if there are some evidences that the first cases started some time earlier [ ] . the spread of the disease started from the northern regions of italy, lombardy and veneto on the st of february : after about twenty positive cases, two small areas had been put in quarantine but, since then, the number of infected increased exponentially in this two districts. from the localities situated farthest south in lombardy the disease spread in closest regions, emilia-romagna above all, with the provinces of piacenza, parma, modena, rimini, but soon after also in other provinces of lombardy (e.g. in the province of bergamo). on the th of march a lockdown for the entire region of lombardy (with a population of about millions of inhabitants) and other fourteen provinces (population of about six millions of inhabitants) was imposed by the central government of italy and then extended, on the th of march, to the entire country. up to the rd of march, the total number of infected is , of which in the lombardy region. since the start of the epidemic in china, a certain number of studies appeared in the mathematical community about this subject: the description of the spatial or temporal diffusion of the infected in given regions [ ] , [ ] - [ ] , the transmission dynamics of the infection [ ] , the economic and financial consequences of the epidemic [ ] , the effect of atmospheric indicators on the spread of the virus [ ] , are only a fraction of the topics under investigation in these days. one of the simplest non-linear deterministic continuous (in time) model of epidemiology is the sir model, in which the overall population is divided in three disjoint classes: s is the number of susceptible individuals, i the number of infectious individuals and r the number of recovered individuals. albeit its non-linearity, the dynamic of the model is fairly uncomplicated and manageable from an analytical point of view, and displays very interesting properties such as the existence of an epidemic threshold (see e.g. [ ] ), making it a very reasonable model. in section ( ) the sir model is introduced and briefly discussed, in section ( ) we will analyze the data of the cumulative number of infected in italy on the base of two simple hypotheses. an upper bound for the timing of the peak of new number of infected is obtained. in the conclusions we will comment about the results and look for possible extensions. the sir model describes the evolution of the individuals in the susceptible, infectious or recovered classes with the following differential equations: the total population n = s + r + i is a conserved quantity from the dynamical point of view, meaning that there are only two independent variables in the set of equations ( ) . the characteristics of this model are well-known and the interested readers can look for example at the discussions in the classical books of braun [ ] and murrray [ ] . here we will make only few observations, relevant for the next sections. some authors do not include the denominator n on the right hand side of ( ), since it is a constant and can be absorbed by a re-definition of the parameter r. however, we will keep it: in this way it is evident the scaling property of this model: if the initial conditions (s , i , r ) are scaled by a common constant factor α, (and so the total population is scaled by a factor α), the solution is scaled by the same factor. some temporal properties of this model, like the time corresponding to a maximum in i (the time of the peak of the infected), do not depend on the scaling. this property is very useful, since the actual number of infected or susceptible (and then of recovered) is in general not known. the reasonable assumption that the same fraction (with respect to the total) of infected, susceptible and recovered individuals are known, gives the possibility, in this case, to compare the measured data with the properties that are scale-independent. the solution of the system ( ) can be written in terms of just one variable: if r(t) is known, i(t) can be obtained by the third equation and s(t) from the first one or from the constrain n = s(t) + i(t) + r(t). if the epidemic is not severe (the number r(t) can be considered small compared to the overall population), an explicit formula for the number of recovered can be obtained in terms of the hyperbolic tangent function. the functions reads as where we used the initial value r( ) = and the parameters (α, β, c) can be made explicit in terms of the parameters appearing in ( ) . the previous is one of the example of the so-called s-shaped epidemiological curve (with a "peaked" derivative, the function sech ) that universally describes an infection disease. the value of the number of infected can be obtained by derivation, i.e. i(t) = αβsech(βt − c) . when considering the cumulative number of infected, r + i, the contribution of sech is negligible on the tails, whereas it is more pronounced in correspondence of the maximum of sech , but it is however small if the value of the parameter β is less than one. in this case, the value of r + i is well approximated by a tanh formula like ( ), with a certain different value of c. this is what we will assume in the next section. it is clear that the description made in the previous section is very basic. however it has the advantage to be manageable and to incorporate the main properties of the sir model. it is not by chance that the first application of the sir model (the bombay plague of ) by kermack and mckendrick [ ] used precisely the tanh formula above. in the following we will base our analysis on two hypothesis: (i) we assume that the cumulative number of infected is described by a tanh model. this assumption is independent on the underlying dynamical model considered, but may be justified on the base of some of them (e.g. the sir model, as shown in the previous section). (ii) we assume that, whatever it is the underlying model describing the evolution of the number of infected, this model is scale invariant, in the sense specified in the previous section. the second hypothesis is fundamental since we are going to look at scale-independent quantities: even in the case the measured number of infected and recovered individuals are different from the actual values, it is possible to estimate these quantities. the cumulative total number of infected that will be considered in the next lines are those of entire italy territory. there are at least two reasons that suggested to not take regional or local data: the first one is that the epidemic started to spread across three different regions (lombardy, veneto and emilia-romagna) and there could not be a correspondence between the locality where a certain fraction of inhabitants reside and the region where this fraction was infected. this is also true at a national level, but the fraction is assumed to be smaller. the second reason is that a non negligible number of workers and students moved, just before the lockdown, from the regions in the north of italy to their regions of origin in the center and south of italy. the possibility that a non negligible flow of infected people passed from the north to other regions should be taken into consideration. by taking the entire national set of data we overpass the above issues. the data can be taken for example from italian protezione civile [ ] or from who [ ]. the cumulative total number of infected will be indicated by f n , with f = corresponding to the number of infected on st of february . the subscript n stays for the number of days from the starting of epidemic. these data will be opposed to the continuous formula the value of β will be taken to be constrained by the equation the function f ( ) then depends on two parameters, α and c. when necessary, to stress the dependence on these parameters, we will denote the function with f α,c (t). the cumulative final number of infected expected from formula ( ) is given by f ∞ = α( + tanh(c)). it is possible to estimate the parameters α and c by minimizing the difference between the actual and predicted number of cases, i.e. minimizing in order to have a reasonable minimum number of data, we start the analysis by taking n ≥ . the values of the parameters minimizing the sum s n are reported in table ( ) at the st day of epidemic the final number of infected is estimated to be , but this is only a lower bound since the values of α n and c n are increasing. a plot of f α ,c (t) and of the cumulative number of infected is reported in figure ( ) . a fundamental observation is that the function s n actually has a basin of depressed values, showed in detail in figure ( ) for a given value of n. this basin of minimum seems to indicate that there is a given function α(c) giving a family of tanh curves with clearly, by considering a number n of values of α n and c n to fit a k , k= ,..., , we will obtain a set of values {a k,n }. by fitting all the data available (i.e. by taking n = ), we get the following values for the coefficients a k : it is possible to get more terms in the sum ( ), but the cubic term is sufficient to get a formula accurate enough to what we are going to say. the plot of the fit is given in figure ( ), together with the values of the residuals α n − k= a k c k n , where the values a k are those given in equation ( ) . a comparison between the curve α(c) and the basin of minima for s n has been plotted in figure ( ): the red curve is the function ( ) with the black dots giving the actual values of (c n , α n ) in table ( ) . the function α(c) denotes a trend in the data that may be useful. if in the next days the values of the infection continue to rise, it is reasonable to expect that the values of α and c will be constrained closely by the same curve. clearly the model used here is rough, but it can give at least an idea about the future trend of the data. we are tacitly assuming that there will be no other cluster of infection around italy in the next days: the point will be discussed later. now we consider the function f in ( ) as a function of t and c alone, since the value of a is constrained by the curve ( ) . the plot of the derivative of this function (with respect to t) gives the time of the peak of infections. the plot is reported in figure ( ): we notice that the maximum of the derivative of the cumulative number of infected increases with n up to c ∼ . and then decreases by increasing c. this gives an upper bound for the peak of new number of infected (the point where the second derivative of f (t) ( ) is zero), given by days after the first infection. in the next days (at the moment of writing we are exactly at the th of march, exactly at days) the description above can be tested. the above analysis, despite using a rough function for the total number of infected, is able to give an upper bound for the time of the peak of new infected (around rd of march) thanks to the observation that the values of α n are, in a certain sense, not independent on the values of c n and are well described by a polynomial interpolation with linear coefficient. the hypothesis about the scale invariance of the underlying model (that, we repeat, not necessarily is represented by the sir model) is fundamental for the accuracy of the result. another underlying assumption is that the restrictive measures will be kept and observed in the next days and there will be no other clusters in the south of italy (in the sir model language, the values of s are below the epidemic threshold, see e.g. [ ] ). in the unfortunate case that there will be other clusters, it is possible to think at a substitution of the tanh curve by a combination of such functions: if there are two clusters of comparable magnitude, then we will have f (t) = α tanh(β t − c ) + α tanh(c ) + α tanh(β t − c ) + α tanh(c ) in a next paper other sets of data will be analyzed, whereas the above analysis will be updated if necessary. coronavirus and oil price crash differential equations and their applications: an introduction to applied mathematics early phylogenetic estimate of the effective reproduction number of sarscov analysis and forecast of covid- spreading in china weifeng lv: high temperature and high humidity reduce the transmission of covid- early dynamics of transmission and control of covid- : a mathematical modelling study a contribution to the mathematical theory of epidemics data analysis for the covid- early dynamics in northern italy data analysis for the covid- early dynamics in northern italy. the effect of first restrictive measures modelling and predicting the spatio-temporal spread of coronavirus disease (covid- ) in italy key: cord- -z zwdxrk authors: hittner, j. b.; fasina, f. o.; hoogesteijn, a. l.; piccinini, r.; kempaiah, p.; smith, s. d.; rivas, a. l. title: early and massive testing saves lives: covid- related infections and deaths in the united states during march of date: - - journal: nan doi: . / . . . sha: doc_id: cord_uid: z zwdxrk to optimize epidemiologic interventions, predictors of mortality should be identified. the us covid- epidemic data, reported up to march , were analyzed using kernel regularized least squares regression. six potential predictors of mortality were investigated: (i) the number of diagnostic tests performed in testing week i; (ii) the proportion of all tests conducted during week i of testing; (iii) the cumulative number of (test-positive) cases through - - , (iv) the number of tests performed/million citizens; (v) the cumulative number of citizens tested; and (vi) the apparent prevalence rate, defined as the number of cases/million citizens. two metrics estimated mortality: the number of deaths and the number of deaths/million citizens. while both expressions of mortality were predicted by the case count and the apparent prevalence rate, the number of deaths/million citizens was {approx} . times better predicted by the apparent prevalence rate than the number of cases. in eighteen states, early testing/million citizens/population density was inversely associated with the cumulative mortality reported by march, . findings support the hypothesis that early and massive testing saves lives. other factors --e.g., population density-- may also influence outcomes. to optimize national and local policies, the creation and dissemination of high resolution geo-referenced, epidemic data is recommended. to optimize epidemiologic interventions, predictors of mortality should be identified. the us covid- epidemic data −reported up to - - − were analyzed using kernel regularized least squares regression. six potential predictors of mortality were investigated: (i) the number of diagnostic tests performed in testing week i; (ii) the proportion of all tests conducted during week i of testing; (iii) the cumulative number of (test-positive) cases through - - , (iv) the number of tests performed/million citizens; (v) the cumulative number of citizens tested; and (vi) the apparent prevalence rate, defined as the number of cases/million citizens. two metrics estimated mortality: the number of deaths and the number of deaths/million citizens. while both expressions of mortality were predicted by the case count and the apparent prevalence rate, the number of deaths/million citizens was ≈ . times better predicted by the apparent prevalence rate than the number of cases. in eighteen states, early testing/million citizens/population density was inversely associated with the cumulative mortality reported by march, . findings support the hypothesis that early and massive testing saves lives. other factors -e.g., population density− may also influence outcomes. to optimize national and local policies, the creation and dissemination of high-resolution geo-referenced, epidemic data is recommended. to control a pandemic associated with a substantial mortality −such as covid- −, who recommends massive testing [ ] . in spite of its relevance, the power of testing-related variables to predict mortality has not yet been empirically investigated in this disease. to predict and identify when and where mortality is likely to occur, at least three types of metrics may be considered, which focus on: (i) cases (counts), (ii) disease prevalence in a specific geographic location and/or time, and (iii) the demographic density of infected locations [ ]. however, assessing the actual prevalence of a disease characterized by a substantial number of asymptomatic infections -such as covid- − is not possible, unless % of the population is tested with a highly sensitive test, repeatedly [ , ] . consequently, we use the term apparent prevalence to describe the ratio of test-positive cases to all tested individuals. if expressed per million residents, the apparent prevalence can compare different geographical units, e.g., each and all states of the us. unfortunately, to conduct comprehensive studies that investigate numerous states, a protracted research program is required. to rapidly provide policy-makers with usable information, here a quasi-real time assessment was designed, which captures both nationwide and state-specific dimensions. analyzing the epidemic data reported in all states of the usa, during march of (the month when testing started), we investigated whether testing-related variables -including massive and early testing− predict mortality. six variables were assessed as possible predictors of fatalities: (i) the number of the first week of testing; (iii) the cumulative number of (test-positive) cases through - - , (iv) the number of tests performed/million citizens; (v) the cumulative number of citizens tested; and (vi) the apparent prevalence rate, defined as the number of cases/million citizens. to . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not certified by peer review) the copyright holder for this preprint this version posted may , . . https://doi.org/ . / . . . doi: medrxiv preprint examine the predictive ability of these variables, we modeled the data using a nonparametric machine learning approach known as kernel regularized least squares (krls) regression [ ] . to implement the procedure we used the krls r software package [ ] . krls is appropriate when linear regression assumptions −such as linearity and additivity− are not met and the precise functional association between the predictors and criterion is unknown. because there is no prior knowledge on the use of these composite variables, no pre- established method or criterion was chosen to analyze the data. instead, recognition of patterns observed after the data were collected was adopted. when distinct patterns were observed -such as l-shaped data distributions [ ]−, thresholds were selected to match the upper limit of a data segment linearly distributed so that the intersection of two orthogonal lines would identify three groups of data. a public source was used to collect the overall us and state-specific data on the covid- pandemic, which was complemented with state-specific population data [ , ] . all analyses included data from each state of the us (supplemental table ) . the six predictors accounted for . % of the variance in number of deaths and . % of the variance in deaths/million cases (supplemental tables a, b) . of the six predictors, two were statistically significant: cumulative number of confirmed cases and apparent prevalence rate. these two variables were comparable predictors of mortality count. however, for predicting deaths per million citizens, the apparent prevalence rate was a . times stronger predictor than was the number of confirmed cases (supplemental table b) . in addition, the number of tests administered during week one of testing/million citizens/population density distinguished three groups of states when the number of deaths/million citizens was the outcome variable ( fig. a) . two of these groups exhibited . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . https://doi.org/ . / . . . doi: medrxiv preprint statistically significantly different medians (p< . , mann-whitney test, fig. b) . whether cases or fatalities are considered, findings indicate that reporting covid- data as counts is not as informative as reporting metrics that consider two or more interacting quantities, such as the apparent prevalence rate and the number of deaths/million citizens. while isolated metrics -e.g., counts− ignore dynamics as well as geographical factors (including population density), composite metrics integrate numerous dimensions that facilitate geographically-specific interventions [ ] . although the krls regression method is a powerful and flexible approach to modeling predictive associations, to rapidly generate results, here it was used to only provide a snapshot- like assessment. if shorter time intervals were used, the krls approach could capture epidemic dynamics. as evidenced by our nonparametric regression results, the variables analyzed offer a combinatorial template that highlights the importance of investigating metrics consisting of interacting quantities. for example, a recombination of those variables (the number of tests performed in week i/million citizens/population density) empirically demonstrate that massive and early testing may save lives (figs. a and b) . such a finding is likely to also be influenced by several factors, including, but not limited to (i) availability of diagnostic kits, equipment, reagents, and trained personnel, (ii) availability of hospital beds and/or intensive care units, and (iii) local and regional demographic and geographical interactions. for example, regions with a higher population density (more abundant and closer contacts among infected and susceptible citizens) tend to be associated with a higher connectivity (more highways, ports and/or airports), which foster epidemic spread [ ] . while composite metrics could address pandemics as a group of local and regional . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . https://doi.org/ . / . . . doi: medrxiv preprint interacting processes, the covid- related information currently found in the press as well as national and international governmental agencies tends to lack point-based (high-resolution), geo-referenced information. while surface-based data are usually provided (e.g., state--related data), this type of data is an aggregate of geographical points and lines and, consequently, is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . https://doi.org/ . / . . . doi: medrxiv preprint cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . https://doi.org/ . / . . . doi: medrxiv preprint . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not certified by peer review) the copyright holder for this preprint this version posted may , . . https://doi.org/ . / . . . doi: medrxiv preprint . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not certified by peer review) the copyright holder for this preprint this version posted may , . . https://doi.org/ . / . . . doi: medrxiv preprint wk i: number of tests performed in the first days of testing. . total tested: total number of people tested wk i / all tests: tests wk i / total tested, i.e., the proportion of all tests that were conducted during the first week of testing, expressed as a percentage tested/mill: number of tests performed per million inhabitants. . cases: cumulative number of confirmed (test-positive) infections cases/ mill inh: the apparent prevalence, calculated by dividing the number of cases by the population (expressed in million inhabitants) outcomes (cumulative values through - - ) . mortality count: number of deaths deaths / mill: number of deaths per million citizens key: cord- -gr zq o authors: ghosh, subhas kumar; ghosh, sachchit; narumanchi, sai shanmukha title: a study on the effectiveness of lock-down measures to control the spread of covid- date: - - journal: nan doi: nan sha: doc_id: cord_uid: gr zq o the ongoing pandemic of coronavirus disease - (covid- ) is caused by severe acute respiratory syndrome coronavirus (sars-cov- ). this pathogenic virus is able to spread asymptotically during its incubation stage through a vulnerable population. given the state of healthcare, policymakers were urged to contain the spread of infection, minimize stress on the health systems and ensure public safety. most effective tool that was at their disposal was to close non-essential business and issue a stay home order. in this paper we consider techniques to measure the effectiveness of stringency measures adopted by governments across the world. analyzing effectiveness of control measures like lock-down allows us to understand whether the decisions made were optimal and resulted in a reduction of burden on the healthcare system. in specific we consider using a synthetic control to construct alternative scenarios and understand what would have been the effect on health if less stringent measures were adopted. we present analysis for the state of new york, united states, italy and the indian capital city delhi and show how lock-down measures has helped and what the counterfactual scenarios would have been in comparison to the current state of affairs. we show that in the state of new york the number of deaths could have been times higher, and in italy, the number of deaths could have been times higher by th of june, . in december , an outbreak occurred in wuhan, china involving a zoonotic coronavirus, similar to the sars coronavirus and mers coronavirus . the virus has been named severe acute respiratory syndrome coronavirus (sars-cov- ), and the disease caused by the virus has been named the coronavirus disease . since then the ongoing pandemic has infected more than million people and has caused more than thousand deaths worldwide. since the initial outbreak, several different studies have tried to estimate the number of infections that stem from a single infected patient in order to predict the potential for transmission of the covid- virus. in most cases, it was seen that r > , implying exponential growth through infection of a vulnerable population. original estimates placed mortality rates for individuals at high risk at . % with those suffering from cardiovascular or kidney disease having even greater susceptibility . the sars-cov- virus has no available treatment as the pathways for proliferation and pathogenesis are still unclear . current treatments are based on those effective on strains of the previous sars coronavirus and mers coronavirus. the sars-cov- virus is able to replicate rapidly during the asymptomatic phase and affect the lungs and respiratory tract, resulting in pneumonia, hypoxia, and acute respiratory distress . infected patients are directly dependent on external ventilation in most cases. with the increasing pressure on the health systems due to reliance on intensive care units or non-invasive ventilation, health strategies were required to be implemented. the concern was to ensure the number of infected patients does not exceed the health system's ability to cope with it. it also focused on increasing the capacities of available health systems at the time. under the conditions at the time, with a highly pathogenic sars-cov- that is able to spread asymptotically during its incubation stage through a vulnerable population, policymakers were urged to contain the spread of the infection, and minimize stress on the health systems and ensure public safety. this was done by issuing orders for widespread lock-down and implementing social distancing measures. all non-essential businesses and services were shut down until further notice. taking measures to reduce stress on the health sector and diminishing the number of infected patients is important to end the pandemic, and understanding the effectiveness of a lock-down enables the distinction of good safety measures from bad ones. analyzing effectiveness of control measures like lock-down allows us to understand whether the decisions made were optimal and resulted in a reduction of burden on the healthcare system, and broke chain of transmission, preventing its spread and reducing the reproductive rate of the virus. any optimal policy considers a trade off between the benefit associated with lock-down and cost of reduced aggregate output. aggregate output decreases as a function of the stringency of the policy, commitment from the government to maintain the level of stringency and adherence of general population. aggregate output decreases through lower supply of labor, lower consumption and hence through lower investment, which results from investors' expectation of a lower marginal product of capital. on the other hand benefit associated with lock-down can be seen through the number of lives potentially saved and in curbing the pandemic early so that economic activity can be restarted early. our objective in this work is to understand the benefits obtained from stringency measures adopted by governments across the world in terms of its health benefits. in the remaining of this section we describe our contribution and related works. subsequently in section we provide a brief overview and mathematical underpinning of the tools that we use and describe our data driven methodology and in section we present our results in three different geographic setup. finally, in section we present some concluding remarks. in this work we consider stringency measures adopted by governments across the world and provide a counterfactual assessment of the benefit from those measures in terms of health benefits. in order to estimate the counterfactual metric (say number of deaths), we use a geographic location as a treatment unit (say italy) and a set of other geographic locations as donor group (say brazil and united states). we take a data driven approach to construct a synthetic control - using pre-intervention period data (say from early february, to march , ) of the donor units and their linear combination, such that the squared error between the estimated synthetic control and the treatment unit is minimized by the choice of weight parameters in this pre-intervention time period. now synthetic control can be extrapolated to estimate the metric. we use multi-dimensional robust synthetic control (m-rsc) as a tool as described in . however, there are few difficulties in applying the tool as it is. firstly, different governments adopted different levels of stringency measures and there were different levels of compliance, and commitment. secondly, there were no 'pure' donor groups as stringency measures were nearly ubiquitous. so we have used various secondary sources of data to score the level of stringency measures , and level of compliance. this allowed us to determine donor groups relative to a choice of treatment unit. finally, we present our results in three different geographic setup -namely in the state of new york, italy and indian city of delhi, and analyze them. in , authors have shown that the reproductive rate of the sars-cov- had significantly decreased after government intervention. they show that the spread of disease was confined if measures were brought into effect early. in , authors use the differential timing of the introduction of stringency measures and changes in google searches for unemployment claims to establish a framework to estimate how each stringency measure contributes to unemployment. authors show that early intervention efforts in the form of non-essential business closure have contributed to less than . percent of unemployment claims. another facet to measure the success of the lock-down is to observe its effects on the health systems. late intervention in the case of italy led to the flooding of hospitals and icus due to exponential spread. however, the national lock-down was effective in reducing the proliferation and decreased the stress on the national health system as observed by authors in . in their paper , authors extend the sir model to include auxiliary state variables in the form of hospital capacity, contact with an infected person, etc. they use a system dynamics model of the outbreak to simulate various lock-down scenarios with recommendations for optimal strategy. in our work we consider the possible outcome if such strategies were not adopted, and present counterfactual scenarios. international travel has also been impacted as a result of efforts to reduce the spread of the coronavirus disease. on the basis of reported cases, models built by show a significant decrease in the number of infections compared to predictions if no travel bans were adopted as an option. their modeling results indicate that travel restriction must be combined with a transmission within the community to curb the spread. in this section we present three examples of the application of m-rsc to derive the counterfactual estimates of possible number of deaths under the changed conditions like delaying or starting the stringency measures at earlier date. we consider three different units of treatments, namely: the state of new york, italy and the indian capital city delhi. in some sense these places have also been termed as regional epicenters of the epidemic. new york has the highest number of confirmed cases in the united states. first case in new york was reported on st march, and new york went into a stricter lock-down on march nd, . we estimate counterfactual considering this as date of intervention. we select among other states from us as donor group using the methods described above. this includes new jersey, california, illinois, and florida among other states. by counterfactual estimate, the number of deaths in new york could have been times higher, and number of confirmed cases could have been times higher. italy was put under lock-down between th march, - th may, . we considered most european countries to model the donor group and selected based on criteria defined above. based on our simulation, we observe that with lock-down measures has been largely successful in italy. without such measures, the number of confirmed case could have been times higher and number of deaths could have been times higher by th of june, . india had one of the most strict stay home order across the country in first phase of the lock-down between march - april ( days), where an entire population of . billion people was put under restricted movement. overall the lock-down had multiple phases, second phase was from th of april to rd of may , and third phase was th of may to th of may, . we present the counterfactual for each of these dates. however, in this case we consider both the daily number of confirmed cases as well as the cumulative number of confirmed cases for phase three of the lock-down. we limit the donor group as all others states of india. figure shows that counterfactual converges closely with the actual at third phase of the lock-down. it should be noted that there are a few discrepancies in reporting. first, there is a weekly seasonality -possibly due to a lesser number of reports over the weekends. second due to a revised higher number of reports on certain dates (high peak). mar, apr, apr linear trend-line fit shows two change points in the growth of number of cases -indicating that the exponential phase came at much later date and growth of the epidemic was under effective control in the earlier stages. since, the stay-home order was applicable across all states and adherence was almost uniform -trajectory of actual and counterfactual remains nearly same. in this work we use multi-dimensional robust synthetic control to understand the effects of stringency measure on covid- pandemic. we construct synthetic version of a location using convex combination of other geographic locations in the donor pool that most closely resembled the treatment unit in terms of pre-intervention period using stringency index and adherence score (using mobility information). results has been compared for the state of new york, italy and delhi, india with actual metric to that of counterfactual predicted by the algorithm. in order to assess the robustness of the predictor we have computed mape and mdape measures and have shown their convergence to less than % absolute error rate. in the future we would like to include additional predictors like testing data, and virus strain information as they become available. another direction of this study is to include parametric epidemic models like sir-f , and compare with m-rsc. as stated above, our objective is to study the effects of government response at an aggregate level in terms of lives saved, and limiting the number of cases that require hospitalization. such interventions can effectively be studied at a comparative level. in other words, if we have data for the evolution of aggregate outcomes, e.g. the number of confirmed cases and deaths, when policy is applied in a group under study versus when the same policy is not applied in a control group. however, government policies were applied at different level across a geographic region. we do not have a mechanism to conduct a randomized trial. hence, we consider using the synthetic control method [ ] [ ] [ ] . in a synthetic control set up, where observational data is available for different groups, we can construct a synthetic or virtual control group by combining measurements from alternatives (or donors). in the following, we provide a brief overview of m-rsc . suppose that observations from n different geographically distinct groups or units are indexed by i ∈ [n] in t time periods (days) indexed by j ∈ [t ]. let k ∈ [k] be the metrics of interest (e.g. number of confirmed cases, number of deceased, number of tests conducted, etc.). by m i jk we denote the ground-truth measurement of interest, and by x i jk , an observation of this measurement with some noise. let ≤ t ≤ t be the time instance in which our group of interest experiences an intervention, namely a government response to control the spread (e.g. stay home order, school or business closure, or mass vaccination). without loss of generality we consider unit i = (say, new york) and metric k = (say, number of deaths) as our unit and metric of interest respectively. our objective now is to estimate the trajectory of metric of interest k = for unit i = if no government response to control the spread had occurred. in order to do that we will use the trajectory associated with the donor units ( ≤ i ≤ n ), and metrics k ∈ [k]. in the following we make two assumptions: ( ) for all ≤ i ≤ n, k ∈ [k] and j ∈ [t ], we have x i jk = m i jk + ε i jk where, ε i jk is the observational noise, and ( ) same model is obeyed by i = in pre-intervention period, i.e. for all j ∈ [t ] and k ∈ [k] we have x jk = m jk + ε jk . as described by authors in , in following we also assume that for unit i = , we only observe the measurement x jk for pre-intervention period, i.e. for all j ∈ [t ] and k ∈ [k]. our objective is to compute a counterfactual sequence of observation m jk for the time period j ∈ [t ], and k ∈ [k], and in specific for t ≤ j ≤ t , and k = , using synthetic version of unit i = . define m = [m i jk ] ∈ r n×t ×k . m is assumed to have a few well behaved properties as required by the algorithm, namely, ( ) m must be approximately low-rank and ( ) every element m i jk shall have boundedness property (for details see ) . to check whether our model assumption holds in practice, we consider n = , t = , k = , with countries as units. we consider number of confirmed cases and number of deceased as two metrics over days between january , and june , . for assumption to hold, data matrix corresponding to number of confirmed cases and number of deceased and their combination should be approximated by a low-rank matrix. as shown in figure , the spectrum of the top singular values (sorted in descending order) for each matrix. the plots clearly support the implications that most of the spectrum is concentrated within the top principal components. the same conclusion holds true when units are states of the united states, and when we consider only countries in the european union. let z ∈ r (n− )×t ×k corresponding to donor units, and x ∈ r ×t ×k correspond to unit under intervention. we obtain m from z after applying a hard singular value thresholding. subsequently, weights are learned using linear regression by computing as described in section , we use m-rsc to construct a synthetic control for the treatment unit using data from multiple control units or donor group using pre-intervention period data. the synthetic control is then used for estimating the counterfactual in the post-intervention period. in our setup, intervention date is typically the date when a stay-home order or lock-down was declared for the treatment unit. however, government policy may have been applied over time with different levels of stringency measures. to understand this we use stringency and policy indices data from oxcgrt , which records the strictness of policies that restrict people's behavior and includes different measures -e.g. school and workplace closure, cancellation of public events, restrictions on gathering size etc. figure , shows the plot of stringency index, with mobility data. it can be observed that the level of lock-down varies over time and geographic region. we use this information in two different ways. first we choose the maximum level of index, first increased level of restriction index, and days after the maximum level of index as various intervention dates and compare their effects. second, we combine this information with mobility data to select the control groups for a treatment unit as discussed below. in order to understand the effect of stringency on a treatment unit for a metric, we need to select a donor group where level of stringency was different or adherence to stringency was different. since, there was a degree of stringency and adherence to such measures at different level -under any possible choice of donor group, we acknowledge, that we will be underestimating the counterfactual -i.e. what would have been without any stringency measures. to estimate the degree of adherence to lock-down measures, we use mobility data from apple, google and facebook. apple mobility data provides a relative volume of directions requests per region, sub-region or city compared to a baseline volume -i.e. percentage change over time from the baseline including weekly seasonality. facebook data provides the relative percentage of population that is staying in the same place and also the percentage of population that moved from a region to another. in facebook data, to quantify how much people move around measure is derived by counting the number of level- bing tiles (which are approximately meters by meters in area at the equator) they are seen in within a day. assuming u d,r is the set of eligible users in region r on day d, and tiles(u) is the number of tiles visited by a given user u in u d,r , total number of tiles visited for that region is given by totaltiles(u d,r ) = ∑ u∈u d,r min(tiles(u), ). change in movement measure is then the difference between a baseline and value on day d for average number of totaltiles(u d,r ). similarly, stay put metric is calculated as the percentage of eligible people who are only observed in a single level- bing tile during the course of a day on an average compared to a baseline. finally, google community mobility report provides a percent change in visits to places like grocery stores and parks within a geographic area from baseline. it can be seen from figure , that the adherence and stringency level do not correspond. for example, in sweden, with increasing level of government measures between march -april, there has not been any significant change in proportion of people staying put or moving between regions. similarly, in other places, it can be observed that while government measures remain at the same level over april, number of people staying put at one place starts declining. selecting donor group: we combine metrics that allow the spread of the virus and similarly combine those that reduces the possibility of spread. we combine them by taking average define a single adherence score. for unit i ∈ [n], ∀ j = i : j ∈ [n], j is a donor unit if adherence score, and stringency index of j is less than i. figure shows this relation in graphical form for a selected set of countries. as per figure , united states, brazil and united kingdom are in the donor group when we compute counterfactual estimate for italy. statistical performance evaluation: in figure we present the distribution of mean and median absolute percentage error statistic for the runs-forecasts from m-rsc algorithm with changing forecast horizon. we consider every mondays between march , to june , , both included to forecast the number of deaths at the end of the day on june , for all states in united states, and compare with actual data. for every state, donor group is selected using the method described above. it can be seen from figure , that both mean and median absolute percentage error statistic are larger, when larger forecast horizon has been considered, and that is to be expected. both mape and mdape converge to a less than percent when horizon is about a week. mape being very large from mdape clearly indicates a right skew in the predicted values. daily update from jhu we use this data to derive metric for units: i.e. number of confirmed cases and number of deceased for each day and geographic locations. facebook movement range maps the relative percentage of population that is staying put and also the percentage of population that moved from a region to another. we use this data in selection of donor units. apple mobility trends relative volume of directions requests per region, sub-region or city compared to a baseline volume, categorized by driving, walking or public transport. we use this data in selection of donor units. oxcgrt strictness of policies that restrict people's behavior, measures combined to provide a score between and , where being most stringent. we use this data in selection of donor units. google community mobility report percent change in visits to places like grocery stores and parks within a geographic area. we use this data in selection of donor units. this exactly corresponds to the places where exponential growth of the pandemic can be observed by march , . the counterfactual prediction for june , on march , for these few places are several times higher as as expected, while, stricter stringency measures were being implemented in the united states around those dates. the reproductive number of covid- is higher compared to sars coronavirus covid- r : magic number or conundrum? estimating excess -year mortality associated with the covid- pandemic according to underlying conditions and age: a population-based cohort study potential treatments for covid- ; a narrative literature review, archives of academic emergency medicine respiratory management in severe acute respiratory syndrome coronavirus infection synthetic control methods for comparative case studies: estimating the effect of california's tobacco control program robust synthetic control mrsc: multi-dimensional robust synthetic control estimating the impact of physical distancing measures in containing covid- : an empirical analysis the impact of shutdown policies on unemployment during a pandemic the effects of containment measures in the italian outbreak of covid- lockdown, one, two, none, or smart. modeling containing covid- infection. a conceptual model the effect of travel restrictions on the spread of the novel coronavirus (covid- ) outbreak variation in government responses to covid- an interactive web-based dashboard to track covid- in real time data for good: new tools to help health researchers track and combat covid- mobility trends reports helping public health officials combat covid- mathematical modeling of infectious disease dynamics we use following five sources of data as described in table all data and code used for this work is made available here: https://github.com/subhaskghosh/lockdown-paper the authors have no competing interests. key: cord- -o fyjqss authors: bonasera, a.; zheng, h. title: chaos, percolation and the coronavirus spread: a two-step model. date: - - journal: nan doi: . / . . . sha: doc_id: cord_uid: o fyjqss we discuss a two-step model for the rise and decay of the covid- . the first stage is well described by the same equation for turbulent flows and chaotic maps: a small number of infected d grows exponentially to a saturation value d{infty}. the typical growth time is given by {tau}= /{lambda}, where {lambda} is the lyapunov exponent. after a time tcrit determined by social distancing and/or other measures, the spread decreases exponentially as for nuclear decays and non-chaotic maps. a few countries, like china, s. korea, italy are in this second stage while other including the usa is near the end of the growth stage. the model predicts , ({+/-} , ) casualties for the lombardy region (italy) at the end of the spreading around may , . without the quarantine, the casualties would have been more than , , hundred days after the start of the epidemics. the data from the us states are of very poor quality because of an extremely late response to the epidemics, resulting unfortunately in a large number of casualties, more than , on may , . s. korea, notwithstanding the high population density ( /km{superscript }) and the closeness to china, responded best to the epidemics with deceased as of may , . chaotic models have been successfully applied to a large variety of phenomena in physics, economics, medicine and other fields [ ] [ ] [ ] [ ] [ ] [ ] . in recent papers [ , ] a model based on turbulent flows and chaotic maps has been applied to the spread of covid- [ ] . the model has successfully predicted the rise and saturation of the spreading in terms of probabilities, i.e. the number of infected (or deceased) persons divided by the total number of tests performed. also a dependence on the number of cases on the population density has been suggested [ ] and the different number of fatalities recorded in different countries (or regions of the same country) attributed to hospitals overcrowding [ ] . in this paper we would like to extend the model to the second stage, i.e. the decrease of the number of events due to quarantine or other measures. different fitting parameters of the model are due to the different actions, social behaviors, population densities etc. of each country but there are some features in common and it is opportune to first have a look to some data available beginning of may, . in the figure , we plot the number of positives (top panels) and deceased (bottom panels) as function of time in days from the beginning of the recordings. some data have been shifted along the abscissa to demonstrate the similar behavior. different countries are indicated in the figure insets. as we can see all the eu countries display a very similar behavior including the u.k. notwithstanding the brexit. the usa case has been shifted of days, which is the delay in the response to the epidemics resulting in the large number of fatalities. in contrast, s. korea reacted promptly and was able to keep the number of positives and more importantly the death rate down. among the eu countries, germany shows the lowest number of deceased cases, which could be due to different ways of counting (for instance performing autopsies to check for the virus like in italy). in any case, the analysis in ref. [ ] shows that different regions of italy have lower mortality rates (for instance the veneto region which borders the lombardy regionthe highest hit) compatible to germany. thus, similar to [ ], we can assume that different overcrowding of health facilities, retirement homes, jails etc. might be the cause for the differences displayed in the figure . to contrast the epidemics, many countries have adopted very strict quarantine measures. social distancing decreases the probability to remain infected thus we expect that countries with lower population density might have better and faster success. on the other hand, if some country adopts non-effective measures or it is too late in the response, the lower population density might hinder the problem for some time. thus in order to better stress the efficacy of the quarantine, we have plotted in figure the number of cases divided by the population density, assuming that it is much easier to perform social distancing if the population density is low. in the figure we see that s. korea and japan, even though their densities are rather high, /km and /km respectively, perform best. we should also consider that s. korea (or japan) is 'across the street' from china, the epicenter of the infection [ , ] , while the other countries are located across a continent or an ocean giving further advantages to organize a response which unfortunately turned out to be weak and badly organized. the last points for china reflects an adjustment to the death rate in wuhan, which probably had similar problems like the lombardy region in italy [ ]: we will not be surprised to see future corrections. . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not certified by peer review) the copyright holder for this preprint this version posted may , . . ( ) and ( ): the number of tests performed daily. zero tests, zero cases and no problem but then the hospitals get filled with sick people and we have a pandemic. in order to have realistic information on the time development of the virus, it is better to calculate the total number of cases divided by the total number of tests, this defines the probability to be infected or the death rate probability due of the virus. we stress that such probability may be biased since often the number of tests is small and administrated to people which are hospitalized or show strong signs of the virus [ , ] . the values we will derive must be considered as upper limits but the time evolution should be realistic. . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not certified by peer review) the copyright holder for this preprint this version posted may , . . not all the countries provide the number of tests performed daily (china) or, alternatively, some provide the cumulative number of tests (spain). in the latter case we have assumed the number of tests performed daily to be constant. in the figure , we plot the probabilities vs time for the same countries as in figures ( ) and ( ) . as we can see some cases show a smooth behavior indicating prompt and meaningful data taking. large fluctuations or missing data are also seen which means that the number of early daily tests was very small. italy shows a decreasing behavior at long times both for the positive and deceased cases suggesting that the epidemic is getting under control. s. korea and japan display a similar behavior but with much lower values. the other countries have not saturated yet or are close to it and the figure (and ) suggests that the uk, france and spain will overcome italy, while germany performs best among the eu countries analyzed here most importantly regarding the death rate. we have discussed and applied the first stage of the model in refs. [ , ] . we briefly recall it and write the number of people (or the probability) positives to the virus (or deceased for the same reason) as: in the equation, d gives the time, in days, from the starting of the epidemic, or the time from the beginning of the tests to isolate the virus. at time d= , Π( )=d which is the very small value (or group of people) from which the infection started. in the opposite limit, , the final number of affected people by the virus. equation ( ) has the same form observed in the figures ( ) and ( ) , but in reality it should be applied not the number of positives (or deceased) but to their probabilities, i.e. the number of cases divided by the total number of tests. the main reason for this definition is to avoid the spurious time dependence due to the total number of tests, which varies on a daily basis and very often not in a smooth way [ , ] . in the figure we have plotted the probabilities for different countries since the data is available. it is important to stress that the information on the total number of daily tests is crucial and should be provided also to avoid suspects on data handling. if we treat equation ( ) as a probability then we expect to saturate to ݀ ஶ at time t crit . at later times, if social distancing is having an effect, we expect the probability to decrease and eventually tend to zero. in the figure , we see exactly such a behavior for the cases of two italian regions: lombardy and sardinia [ ], https://github.com/pcm-dpc/covid- . for times larger than t crit the decrease is exponential and can be described as for nuclear decays and non-chaotic maps [ , ] : ( ). α and t crit are fitting parameters. values for the lombardy region are α= . ( . )d - and t crit = ( )d for the positives (deceased). we can infer the decay time as τ d = /α= ( )d suggesting that roughly τ d after the maximum the epidemics should be over, i.e d max ≈ t crit +τ d = ( ) days from february , . . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . from the figure it is quite easy to derive the value of t crit given by the maximum. this value differs slightly for the positives and the deceased as well as for the different regions. thus it is important to have enough data to perform best fits using equations ( ) and ( ) . for some of the countries plotted in the figure , the first stage only can be reproduced and we can obtain the saturation value ݀ ஶ . the value of t crit depends on many factors including the population density, the weather temperature, humidity etc. and especially social distancing or any other measure used to contrast the epidemics. if no measures are adopted (herd immunization or natural selection approach), such as for some countries like sweden and the uk at first, then we expect the plateaus in figures ( ) and ( ) to last longer but eventually the process will be described by equations ( ) and ( ). the herd immunization approach might be reasonable if we do not think we are going to get a vaccine soon. however, in such cases we may also expect to be flooded by positives and deceased persons jeopardizing the health structures and harm the sanitary personnel [ ] . a country like sweden with excellent sanitary structures and low population density ( /km ) may succeed in this task, but the same attempt in the uk ( /km ) was a disaster and quickly abandoned as can be seen from the figures ( - ). in particular in figure we see that the uk have the largest probabilities, https://www.who.int/emergencies/diseases/novel-coronavirus- . of course the predictions have validity if the conditions are not changed, for instance relaxing the quarantine too soon. if these conditions are modified then we may have an increase of the cases again and return to the original curve given by equation ( ), such a behavior might be noted in figure ( ) for japan. at the same time when the olympics were under discussion, japan interrupted the covid- testing as can be seen from the plateaus in figures ( ) and ( ) lasting approximately days. thus it is important to understand when to relax the measures and for this reason we have plotted in figure two cases. lombardy is the worst case in italy with more than , deceased in contrast to sardinia with about as of may , . in the figures we can see that the probabilities are much lower for sardinia, which could be regarded in some sense as the future of what should eventually happen in lombardy. the . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . https://doi.org/ . / . . . doi: medrxiv preprint population density of sardinia is relatively low, /km , and it is an island away from the mainland. it is in many respects very similar to s. korea with lower population density. thus, the measures might be relaxed in sardinia following the example of s. korea after careful instructions to the population and random every day testing to search for positives to isolate them. this will provide crucial information on the social behavior and on the virus spread. we will show below that the model predicts a small number of positives and deceased for lombardy around or after may , thus shelter at home might be extended up to that day. it would be important to send some signals to the population of return to normality after months of sheltering by organizing for example sportive events in sardinia. the italian national sport, "serie a", might organize - games per day in different sardinian towns, with empty stadiums and broadcasted live. other limited activities but strongly controlled could be allowed in less affected regions such as calabria, abruzzo and other southern italian regions discussed in ref. [ ] . releasing all measures for the entire country at the same time might be not too wise. looking at other countries experiences, we would suggest that quarantine should not be released before the probability for positives is less than % (the maximum of s. korea, figure ). below such a value, the other countries may follow the s. korean approach but if they are not organized to do that, reopening too soon may be dangerous. the model describes very well the data and might be used for the everyday control on the resurgence of the epidemics. it offers another great advantage: we have described a way to eliminate misleading inputs due to the number of everyday test. we can proceed in the inverse direction in order to predict the total number of deceased and positives cases. the task that we have now is much easier and it is the prediction of the daily tests for each case. as we have seen from figures ( - ) , there were some wrong decisions taken by the different countries at the beginning of the epidemics (apart s. korea and japan) resulting in a very small number of tests. after - weeks the number of test per day was increased and eventually become constant. it is this behavior we have to predict in order to extend our model to the total number of cases. in figure ( ) we plot the total number of tests vs time in days from the beginning of the recordings . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . for lombardy (february , ). we have fitted the data with a power law function as indicated in the figure but any other suitable function f(d) might do as well. as we see from the figure, the italian data is fitted very well with a small error on the fitting. fits performed to other countries give a power exponent ranging from . (s. korea) to . (uk). this is also an indication of how well organized the response to the pandemic is. in the ideal case we expect the power to be about , the value for the uk suggests some change of strategy (from i.e. herd immunization to quarantine) and because of such high value we are not able to make predictions on the total number of tests say days after may , . multiplying equations ( ) or ( ) by the predicted number of tests from figure , gives the total number of predicted cases and are compared to the data in figure . we assume a conservative % error in our estimates due to the different fit functions. without social distancing, using equation ( ) gives , (± , ) for the positives and , (± , ) for the deceased days after the beginning of the epidemics in lombardy. if the exponential decay given by equation ( ) is taken into account (due to the quarantine), the values decrease to , (± , ) and , (± , ) respectively, thus about , saved lives in lombardy alone! there is an important difference between the two stages: if the first stage alone would be at play, the epidemics may continue after the days and eventually slow down at longer times. recall that the spanish flu started in and lasted almost months with an enormous death toll, https://www.washingtonpost.com/graphics/ /local/retropolis/coronavirus-deadliestpandemics/. because of the second stage, now the predicted values are given by the maxima in figure , these occur and days respectively after the start of the epidemics recording, i.e. may and , respectively. these values are close to the sum of t crit and τ d reported above. if we assume a power law to reproduce the available data for the number of tests, figure , then we can write the total number of cases in the second stage as: . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . https://doi.org/ . / . . . doi: medrxiv preprint the fitting parameters m - are reported in figure for lombardy. to find the maximum of equation , we simply equate its derivative to zero: ( ). using the empirical relation above connecting d max and t crit we get: . this relation is very useful especially when the data does not show the exponential decrease since it reduces the number of free parameters entering equation ( ) . similar relations can be derived for different parameterizations for the total number of tests. - show that the usa was hit hard by the covid- resulting in different responses from the different states. in this section we will analyze some of these states and more analysis can be found in the supplemental material or available from the authors. in figure , we plot the probabilities for the state of california (ca) for the period indicated in the inset, compare to figure and . the discontinuities are due to the change in the number of tests performed daily. notice that march , coincides with the quarantine declaration in italy, thus it was not a surprise that the virus spread quickly. fortunately, the san francisco mayor and the california governor placed strict restriction as early as march without waiting for better testing, https://www.sfdph.org/dph/alerts/files/healthofficerlocalemergencydeclaration- .pdf. this action saved a large number of lives and kept the ratio deceased/positives very low, compare to figure . we can correct in some cases for the low number of tests. large data taking has a better statistical value thus we can renormalize the data where the jumps occur to the value at later times. in the right panel we display the result of the renormalization together with the fit using equation ( ) . the hardest hit state was new york. in the figure we display the probabilities together with the . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . fits using equations ( ) and ( ), compare to figures , . the ratio deceased/positives seems smaller than the lombardy one, however particular attention should be paid to the counting methods and some confusion might arise if the data refer to the state of new york (ny), https://coronavirus.jhu.edu/map.html, or to new york city (nyc), https://covidtracking.com/data/state/new-york#historical, the difference being roughly deaths since most cases are in nyc. the bending down of the curve is evident and we can make a prediction using equation ( ) . the resulting fit is displayed in figure , it follows well the available points but further confirmation will be given by future data. using the predicted number of tests for ny given in figure , left panel, and the probability fits from equation ( ) and ( ) . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . https://doi.org/ . / . . . doi: medrxiv preprint we have proposed a two-step model to the rise and decay of the epidemics due to the covid- . the model needs some input parameters to predict the time evolution up to the saturation of the probability as in equation ( ) . once the plateau is reached, given by the d ∞ parameter, the probability remains constant for some time depending on the quarantine measures or other environmental factors. for the italian case, the first test was published on february and equation ( ) was fitted on march before the quarantine was announced, i.e. march [ ] . the plateau was reached around march as predicted by the model. these dates suggest that the quarantine was not effective in reducing the maximum probability and the time when this was reached. the quarantine became effective roughly days after saturation. thus we can estimate that it takes about weeks before the quarantine gives an effect and the probabilities start decreasing, this is the value of t crit entering equation ( ) . we can suppose that if the quarantine was announced say days earlier, then the exponential decrease, equation ( ) would had intercepted the rise, equation ( ), earlier resulting in smaller probabilities. this is what happened to s. korea and japan, and it would explain the differences among countries: the later and the more feeble the quarantine the higher the probabilities and the longer the time to return to (quasi) normality. from these considerations we can estimate the time it takes for other countries also if the probability decrease is not seen yet. after reaching the top of the probability, see figure , it took roughly days for italy to see the decrease. if the data shows the decreasing part then a fit using equation ( ) will be performed, otherwise we use the parameters found for italy. in figure we plot the predicted total number of positives, in the same figure we plot the same quantities for the deceased. countries, which did not provide the number of daily tests (spain, china), were not analyzed including the uk because of the large increase in testing especially at later times in coincidence to their prime minister hospitalization. . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . sweden decided to follow a different path non imposing the quarantine (herd immunization or natural selection), a choice that could be justified under the assumption that the vaccine will not be available soon enough. in figure we plot the probabilities for sweden, finland and norway since they are bordering countries. the probabilities are quite different especially regarding the death rate. we predict for sweden about , (± . e ), , (± ) positives and deceased respectively on june , . since there is no quarantine we are not able to estimate t crit and the decay rate. on the same day, using equation ( ), we predict for the other countries the values , (± . e ) and , (± ) for finland, , (± . e ) and (± ) for norway. thus we see that herd immunization takes a heavy toll not justified by the larger sweden population (a factor of respect to the other countries considered) and it will be very difficult to explain this choice to the relatives of the victims and their lawyers. we do not have any explanation for the difference in the number of deceased for norway and finland since the number of positives is practically the same. authorities of those countries should investigate this difference further. one feature worth noticing from figure is the time delay and the slow spread of the covid- , this could be due to the extremely cold weather in the winter and early spring for these countries. there is some hope that the warmer season will help to normalize the situation, as for flu. other reasons might be put forward, for instance if the virus is somehow adapted to bats, we can naively assume that it will be more deadly for temperatures higher than c, since below such value most bats hibernate. temperature difference might explain the spread delay in countries like france, uk and germany respect to italy. of course, other ingredients must be considered such as people flows from/to infected places, population density etc. no matter what the reasons may be for a temperature dependence of the spread it is clear that some systems perform better if it is not too hot or too cold. we can test these hypotheses using the us states data since they cover a wide range of temperatures in the spring season. in the figure , we display the results obtained using the data on may , . different states values were averaged if their temperatures differ about c in order to have better statistics. gaussian fits give c at the figure . probabilities as function of time for the countries indicated in the inset. sweden is adopting the natural selection option resulting in higher probabilities compared to nearby countries. different starting data depend on which day the complete information needed for the plot was released. . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not certified by peer review) the copyright holder for this preprint this version posted may , . . maximum and a similar variance, see the inset in the figure . the large error bars and the discrepancies respect to the fit at higher temperatures refer to touristic places: florida, hawaii and louisiana, particularly popular during spring break. if we take this result at face value, it predicts about people per million inhabitants to be positive to the virus in the summer with c average temperatures. even if this value seems small, it is a seed d to restart the epidemic. we can already see this in the figure from the large increases over the gaussian fit corresponding to high population density states and large touristic flows. it might suggest that low temperatures in hospitals may decrease the virus aggressive spreading, keeping in mind that a vaccine is the only definitive solution. until then we can only aggressively test and isolate positives similarly to the s. korean approach to the pandemic. in conclusion, in this paper we have discussed the predictive power of a two-step model based on chaos theory. a comparison among different countries suggests that it would be safe to release the quarantine when the probability for positive is lower than %, the maximum value for s. korea. this implies that, if the quarantine is dismissed, then the same measures, as for the koreans, should be followed by the other countries: careful testing, backtracking and isolation of positives. herd immunization or natural selection is very difficult to justify from the data available so far, especially since we are dealing with thousands of human lives no matter the age or other nonsense. deterministic chaos fluid mechanics mediterr. conf. control autom. -conf. proceedings, med' [ ] a. bonasera, g. bonasera [ ] povh, b., rith, k., scholz, c., zetsche, f., rodejohann, w., particles and nuclei, , springer; isbn - - - - . key: cord- -qarz o z authors: ansumali, santosh; prakash, meher k title: a very flat peak: exponential growth phase of covid- is mostly followed by a prolonged linear growth phase, not an immediate saturation date: - - journal: nan doi: . / . . . sha: doc_id: cord_uid: qarz o z when actively taking measures to control an epidemic, an important indicator of success is crossing the "peak" of daily new infections. the peak is a positive sign which marks the end of the exponential phase of infection spread and a transition into a phase that is a manageable. most countries or provinces with similar but independent growth trajectories had taken drastic measures for containing the covid- pandemic and are eagerly waiting to cross the peak. however, the data after many weeks of strict measures suggests that most provinces instead enter a phase where the infections are in a linear growth. while the transition out of an exponential phase is relieving, the roughly constant number of daily new infections differ widely, range from around in singapore to around just in lombardy (italy), and in spain. the daily new infection rate of a region seems to depend heavily on the time point in the exponential evolution when the restrictive measures were adopted, rather than on the population of the region. it is not easy to point the critical source of these persistent infections. we attempt to interpret this data using a simple model of newer infections mediated by asymptomatic patients, which underscores the importance of actively identifying any potential leakages in the quarantine. given the novelty of the virus, it is hard to predict too far into the future and one needs to be observant to see if a plan b is needed as a second round of interventions. so far, the peak achieved by most countries with the first round of intervention is extremely flat. sars spread in around countries, infecting around , individuals globally [ ] . quarantine strategies were implemented by several governments, and they were effective in reducing the number of newer infections [ ] . the number of casualties from other epidemics such as swine flu in , mers in were also similarly contained. however, the infections caused by the novel coronavirus (covid- ) continue to increase. the world health organization has declared covid- as a pandemic, the first one in the st century [ ] . since so far there is no known treatment or vaccine, after weighing the damages to the lives versus that to the economy, most governments across the world have implemented the nonpharmaceutical interventions such as strict measures of social distancing, or even lockdowns and curfews, guided by the historic response to the . for an emerging pandemic such as covid- , a first natural scientific impulse is to model it via standard epidemiological models [ ] [ ] such as the susceptible-infected-recovered (sir) model to predict how rapidly the infections can spread without an intervention or how quickly a lockdown program may be planned [ ] . governments and public health modelers are interested in understanding the effectiveness of various strategies [ ] [ ] [ ] [ ] starting from a complete lockdown, or a reduced social contact [ ] and a subsequent release of restrictions in a phased manner. many such models have already been developed to model covid- , and have been guided by the past intuitions from modeling how the epidemic spread declines with changes in season or with active containment strategies, and the success of china which after months of lockdown reports no new covid- infection cases consistently for more than weeks. in the time of a pandemic, an important question asked on daily basis by public and policy makers is when the pandemic is going to 'cross the peak'. crossing the peak signifies that one may expect fewer cases of infection, compared to the previous day. it sends positive signals of pandemic containment to the people as well as to the economy and other aspects of the social life. it also indicates the time for lowering the guard is not too far. as such, a few weeks after these strict measures, and noting the reported success of china, governments of various provinces and countries are waiting for the new daily infections to cross over the peak. because of the drastic measures, the number of daily new cases are no longer increasing in many places. however, it is worrying that they do not have clear signatures of a downward trend either. in this context, we perform a detailed analysis of the nature of this peak, and whether it has been achieved. as it turns out, most provinces and countries that implemented containment are no longer in an exponential growth phase, but rather enter a new, and possibly unexpected, linear growth phase which we discuss here. in the early stage of infection spread, each infected individual becomes a vector for transmission the infection. thus, rate of increase infections can be captured by a simple model, di/dt=r.(i-r).(p-i), eq ( ) where i is the number of infected individuals, r is the number of recovered individuals, p is the total population and rate of transmission is r. specifically, with the current variant of covid- , where the median recovery time is - days, the exponential increase in i is always much faster than the slow growth of r. one can see from the data in the growth phase of the pandemic spread in any country that typically r < - % of i, and i << p. in principle, one can also consider detailed models such as the sir model, or even detailed agent-based models assuming a more realistic social contact structure. however, the simple phenomenological model di/dt=a.i eq ( ) does capture the growth of i. the number of infections in a well diffused society, community or a province would thus grow as i(t)=i exp(at), where i is the number infections at t= . however, the transmission across countries or less-frequented provinces, occurs much through a jump-diffusion process, with only occasional jumps over these boundaries, and a diffusion within the region. as a consequence, the trajectory of a country with two epicenters can be thought as two independent weakly interacting subsystems which leads to emergence of a multi-exponential i(t)= i , exp(a t)+ i , exp(a t) with the i , and i , being the number of infections in these two decoupled regions at t= , and a , a the rates in these two regions which may or may not be same depending on mobility in the cities, any other restrictions imposed by the local governors. for administrative reasons, one may be interested in following the trajectory of the world, or of a specific country. but depending upon the lag between these multiple hotspots, the exponential nature of the growth gets masked. thus, for detailed studies, one needs to unmask this data by decoupling these multiple exponentials and focusing on the individual provinces, which are possibly separated from other provinces by travel restrictions. after decoupling, it is clear that the different regions very similar exponential growth curves. decoupled data show a shift from exponential to linear growth. we first illustrate the qualitative change in the spread of infection using the data from the number of infected cases in lombardy [ ] . as figure shows, about a week after the lockdown, the growth in lombardy transitioned from an exponential to a linear growth. in figure , we study the growth in south korea, singapore, saudi arabia, switzerland, spain, [ ] germany and two of its states (bayern, baden) and another italian province venice. it is apparent from the data that the later part of the data from all these countries shows a clearly linear trend. even the number of deaths recorded on a daily basis also show the same linear trend. clearly a transition from the exponential regime is a relief. and this transition happens around days from the day of the restrictive measure, possibly coinciding with the distribution of the incubation time. however, the slope of these curves, which indicate the number of new daily infections is very different for the different regions. south korea's response by contact tracing and extensive testing has been widely praised. although the number of daily new cases has crossed a peak there are still an average of cases every day from the th of march till the rd of april. while new cases may be a manageable number in terms of resource allocation, several other countries or provinces are in a linear growth regime with much larger daily new cases: switzerland ( ), italy (≈ ), germany (≈ ), spain (≈ ) despite the containment measures. to date, other than china which continues to report nearly zero new infected cases every day for the past few weeks, all other countries are either in an exponential phase or a linear growth phase. the daily new infected cases for the provinces and countries we analyzed in figures , are given in table . the dependence of the daily new cases on various factors such as the extent of testing, the population of the region, etc was studied in figure . a strong correlation was observed with the number of cumulative infections in the region, and the number of daily infections at the time when the containment measures were taken. which seems to suggest that covid- infections at this point are held in a pause. looking back, if the quarantine or lockdown decision is taken later in time the average number of daily new cases would have been significantly higher. of course, the same message applies for the future before relaxing the social restrictions that have helped contain the spread of the covid- . there is now enough evidence that one main difference of covid- has been the high rate of transmission by the asymptomatic individuals . in an attempt to model the observed transition from an exponential to a linear growth phase, we resort to simple rate equations by including a, which is the number of asymptomatic patients. l is the infection rate via asymptomatic individuals, d reflects the natural rate of reduction in the numbers of asymptomatic patients post-incubation period, b is the rate at which asymptotic individuals transmit to other individuals, r is the rate at which by performing tests one reduces the a by moving them to a quarantine. c is the rate of increase due to non-human sources such as aerosols or contact surfaces. we perform the simple analysis with µ= , assuming the recovery rate is much slower than the rate of infection. a median hospitalization time of weeks from the data does support this assumption, as in the exponential phase the increase in the number of infections in these weeks is much higher. however, the following analysis should remain valid even if µ≠ . in figure , we illustrate a simulation of these differential equations to show the transition to linear regime, the nature of the peak, constant rate of daily fluctuations and the reduction in the number of the daily infection rate with r. in eq. a, the dependence of di/dt on i is the main reason for the exponential growth. a quarantine of infected individuals removes this dependence with a= , and turns the behavior to one of linear growth. the number of asymptomatic individuals in principle will reduce to zero after a strict implementation of quarantine, followed by the decay rate of infection in the individuals. however, assuming our model is realistic, a sustained increase in infections appears to be possible only through a leakage in the quarantine program. the lowest rate of spread of i will be in the steady state da/dt= , when where g' is the growth in a, due to a leakage from i. since the measures are already in place, this will be a weak coupling. as long as a≠ , the cumulative number of infected individuals will continue to increase until most or all of the susceptible individuals are infected (eq. a), which is quite undesirable considering the scale of devastation that has already been caused to even the developed countries by roughly in infections. according to this model, and not too far from common sense, a significantly high rate of testing r and keeping a check on the potential leakage from infected individuals even under a quarantine condition can reduce the number of asymptomatic individuals. in this work, we note by studying the covid- infection data from several countries which implemented quarantine that the exponential growth phase ends, but it is followed by a linear growth phase. a deviation from an exponential growth phase is a relief to the population, and a sign of success of the containment measures. the significance of this is that there are no new infections caused by an individual who is understood be infected. however, via indirect route or secondary effects there is a still a constant rise in the cumulative number of infections, in many places with a very high daily rate. the peak is flat. the unmoderated peak is understood at the population level using established sir models, when most or all of the susceptible individuals develop infections and immunity. in the first week of april, the number of global covid- infections reached million. the hospital resources, health care personal, economies are already overwhelmed by the pandemic, when as little as in are infected in many developed countries. so, at this point in this work we do not attempt to project the dates when a much higher fraction of the society is infected, or evaluate the consequences of such mass scale infection, which may also come with several other assumptions such as the reduction of the virulence upon spreading, etc. further, the definition of who is susceptible is not yet clear. in the initial months of covid- infection, people over an age of , and with comorbidities were considered highly susceptible. however, although at a much lower rate, one begins to hear about healthy individuals in their s or s succumbing to covid- . instead, we focus on the peak that is achievable by active interventions, such as the ones sars, mers had seen. a peak explained in a lay language is the time when for the first time the number of new infections are lower than the previous day crossed. however, with stochastic fluctuations on a daily basis, this statement needs to be interpreted by observing consistent trends over a few days. however, the number of daily covid- infections in many places has been roughly constant, at least for to weeks, after containment measures. even if eventually the new infections do decline because of any reason, it must be understood that the high number of daily infections present a strange situation of having chronic and acute severity simultaneously for at least many weeks, if not longer. however, if instead of following the trends in the daily fluctuations, if one fits a sigmod function to the cumulative infections it can lead to confusing interpretation. extrapolating the slowing exponential trend in the early days following the quarantine, the sigmoid will predict a peak. however, this peak will shift when the same fit is repeated in the days later, with a dominant linear component, as in reality a linear graph does not have a peak. we explore the possibility that the constant rate of new infections is a false signal from the constant rate of testing. if it is indeed the case, the screening tests are showing infections of people not just asymptomatic, but asymptomatic and not contagious. if there is a way of distinguishing the latter, it must be clarified to reduce the global panic levels. however, prima facie this possibility can be refuted because although the countries such as south korea and singapore on the one extreme are performing extensive screening tests ( and tests on average for every detected infection), countries such as switzerland are performing tests only when there are significant symptoms, or pre-existing vulnerabilities, and of course for health care personnel. thus, the new infections arising purely as an artefact of over testing does not seem like a possibility. so far, other than china, most countries have shown only a transition from an exponential to a linear growth phase. the hope is that eventually the linearity will fade out with an exponential decay of the numbers of asymptomatic and infected individuals or at least reduce in intensity as it did in south korea, which showed a shift from a constant daily new infections of to daily new infections of (the second linear regime in figure ). the second linear regime from south korea thus presents an interesting case study. south korea had ramped up its testing capacity from about per day in early february to around , per day from february till at least early april. whether the reduced number of daily infections a few weeks after this ramp up is a consequence of the tests or of any contact tracing that allowed them to test and isolate the large number of asymptomatic individuals or purely depends on the decay time of the infection in the asymptomatic individuals which was has so far been underestimated to be around days needs to be understood with an in-depth analysis of the policies and implementation, which we could not perform even after parsing through the information that is publicly available. it is clear that the availability or implementation of newer resources such as massimmunizations, therapeutic interventions or even the chance that the sars-cov- reduces in lethality due to mutations or seasonal variations were not considered in our analyses. several other models have made the predictions of the peak of the infection at the population level. when any of these possibilities arise, it is possible to adapt those models to predict the peak for a newer country or region which still did not implement those interventions. however, the reality today is different and none of these options at our disposal. we instead focus on how this scenario is evolving, based on the real data rather than assumptions. whether the number of daily new infections continue at the constant but very high daily rate . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint . or decline, should be tracked from the real data rather than making a prediction based on past experiences from other infections. if this hope does not get the support of the data, each country depending upon its current overload of active infections and health care resources needs to have a "plan b". presently the only two interventions available are testing and isolation. both of these are of course qualitative in nature. how extensively to perform the tests, how restrictive and privacy-limiting should the isolation be, has been interpreted differently by different countries. in summary, we ask three questions: has the implementation of these strategies been successful? the answer we find is that it is. from an exponential behavior, most countries transitioned to a linear behavior showing that the infected individuals are not causing new infections. and this happened roughly around days after the implementation of the social distancing or quarantine policies. while this is a moment of triumph and many may be wondering when is a good time to lower the guard and relax the restrictions, we ask the second question: if the goal of reducing the number of daily infections below a manageable level been achieved. from the data after to weeks of strict measures by many countries, it has not yet reached this level. this happens because instead of a decline in the number of daily cases, they saturate, and that too at very high values ( in spain, in germany, etc). we then raise the third question for an open interpretation, not to be restricted by the limited understanding of the authors: why this linear regime persists for so many weeks, and would it reduce in intensity naturally or require a newer intervention such as extensive testing or prevention of unexpected leakages in the system of isolation and quarantine through health care workers, essential services or any other means we cannot imagine today. if there are indeed such leakages, these are not the ones that can be predicted by following the overall number of infections of states or countries as was done in this work, but rather by taking a detailed audit, by tracking the need and rigor of implementation in each industry and segment of the society. we analyze the data using a simple model that seems to suggest that if the linearity persists, this may be due to leakages in the quarantine system, and can be partly compensated by increasing the rate of testing. theoretically, it is possible that the measures adopted by china were much more stringent compared to other countries which allowed a reduction in the new infections. however, given the gravity of the situation, we present our observation, analysis and model in all its humility, for an open interpretation. the aim of this work is mainly to point to existence of this new, and at least for us, linear growth phase with a very high number of daily new infections, which brings a very hard to manage mix of acute and chronic societal burden. given the gravity of the situation the world is facing, the data of the linear phase needs an open interpretation by all the experts. with all good intentions, we wish the slope of this linearity to be reduced in a few weeks, and the linearity shown in this work is not relevant in longer term. however, the data is not currently in favor of such wishes. thus, to understand if the existing interventions are working one needs to detach oneself from the notion of a peak followed by a decline or even the linear table . the data up to st of april was used in these analyses (the data from germany and spain is from the rd of april). the legends in the figures suggest that the number of days for which the growth continued in a linear regime. south korea and singapore have two distinct linear regimes, and the duration in both are indicated. . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint . the slope from the linear regime was compared with several factors to see if any factor could potentially help interpret the linear regime. it is understood that all these factors are not entirely independent but some are connected. the cumulative infections and the number of daily cases around the time when the transition occurred are the most correlated. the number of tests performed per single detection is poorly correlated, and the other variables such as population of the region do not appear correlated. as much as the linear regime suggests the end of the exponential growth phase, a correlation of the daily cases with the average number of infections at the time of transition seems to suggest that the growth is only maintained in a "pause", frozen at the state where the quarantines are implemented. ) were simulated to see if the observations can be captured. the simulation was performed with parameters such as a= . , and the transmission via asymptomatic is individuals is much lesser with l= . . at this growth stage of the pandemic since the data from most countries show a recovery rate of around %, we performed the simulation with µ= . the asymptomatic (blue), and recorded infections (red) with two different testing rates r= . and . initiated after the lockdown period ( on axis which represents the days) are shown in panels a and b. as expected a transition from an exponential to linearity is observed. panel c shows the daily new infections when r= . . since the eq.( ) is stochastic, we added an incubation period drawn randomly from a beta-distribution with a mean of days. for this choice of parameters which were chosen to qualitatively emulate the peak observed in south korea, a a low level of daily new infections persist. however, by varying r, it was seen that these average number of these daily new cases decreases with the r and increases with the time of lockdown (data not shown). . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not peer-reviewed) the copyright holder for this preprint . along with other relevant information for each country or province. *estimated from the german national testing as of april . § the infection data from south korea shows two different linear regimes. the first one which immediately follows the end of the exponential regime is considered. ¶ singapore also showed two linear regimes, a very short linear regime early on, followed by what appeared to be an exponential phase. however, since the absolute numbers are very low and noisy during the first linear regime in this work, the second linear regime was considered. . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not peer-reviewed) the copyright holder for this preprint . https://doi.org/ . / . . . doi: medrxiv preprint summary of probable sars cases with onset of illness from modelling strategies for controlling sars outbreaks who director-general's opening remarks at the media briefing on covid- - real-time tentative assessment of the epidemiological characteristics of novel coronavirus infections in wuhan, china, as at impact of non-pharmaceutical interventions (npis) to reduce covid- mortality and healthcare demand age-structured impact of social distancing on the covid- epidemic in india countries test tactics in 'war' against covid- the simulations driving the world's response to covid- the effect of control strategies to reduce social mixing on outcomes of the covid- epidemic in wuhan, china: a modelling study substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (sars-cov ) data from the official repository of protezione civile an interactive web-based dashboard to track covid- in real time key: cord- - q ppl authors: mandal, s.; kumar, m.; sarkar, d. title: lockdown as a pandemic mitigating policy intervention in india date: - - journal: nan doi: . / . . . sha: doc_id: cord_uid: q ppl abstract. we use publicly available timeline data on the covid- outbreak for nine indian states to calculate the important quantifier of the outbreak, the sought after rt or the time varying reproduction number of the outbreak. this quantity can be measured in in several ways, e.g. by application of stochastic compartmentalised sir (dcm) model, poissonian likelihood based (ml) model & the exponential growth rate (egr) model. the third one is known as the effective reproduction number of an outbreak. here we use, mostly, the second one. it is known as the instantaneous reproduction number for an outbreak. this number can faithfully tell us the success of lockdown measures inside indian states, as containment policy for the spread of covid- viral disease. this can also, indirectly yield notional value of the generation time inteval in different states. in doing this work we employ, pan india serial interval of the outbreak estimated directly from data from january th to april th, . simultaneously, in conjunction with the serial interval data, our result is derived from incidences data between march th, to june st, , for the said states. we find the lockdown had marked positive effect on the nature of time dependent reproduction number in most of the indian states, barring a couple. the possible reason for such failures have been investigated. global pandemic outbreaks are very common nowadays. india is no exception. the severe acute respiratory syndrome ( ), [ ] avian inunza ( ) , [ ] swine flu ( ) [ ] are to name a few. there were others that did not touch upon india but were recent events, such as mers ( ) [ ] & evd (ebola virus disease) ( , , & most recently in ) [ ] . none of the above, however, touched the global pandemic scale, of what has been attained by novel coronavirus, aka covid- , in a short span of time, starting at the end of past year [ , , ] . the global community responded to this unprecedented situation by various policy interventions. wearing masks & face shields [ , ] in public, social distancing norms [ ] were amongst them. more drastic & perhaps draconian step of lockdown [ ] was taken by governments across the world, as a containment policy measure [ ] . we analyse the eect of lockdown on the propagation of covid- viral disease. the instantaneous version of basic reproduction number [ ] of the infection is plotted against time to gauge the success [ ] (or lack thereof) [ ] of this policy intervention in nine dierent states of india. in the following, it is shown that this pervasive containment policy has borne fruit in most of the considered provinces. time dependent or instantaneous reproduction number [ ] is an accurate projection is an in-situ description of virility or virulence of epidemic diseases. the basic reproduction number [ ] gives us the average number of infectee cases per infector from the previous generation, over a given period of time, in a fully susceptible population. various policy implimentation and containment measures appreciably reduce the number of contacts, in turn reducing, the eective number [ ] of susceptible contacts per potential infector. epidemiologist have devised a time dependent parameter, eective reproduction number to assimilate the eect of policy intervention into the basic reproduction number during an ongoing epidemic. this quantity is dened as follows: consider an individual, who turns infectious on day t. we denote by r e (t) the expected number of secondary cases this infectious individual causes, in future [ ] . the instantaneous reproduction number, on the other hand, compares the number of new infections on day t with the infection pressure (force of infection) [ ] from the days prior to t. it can be interpreted as the average number of secondary cases that each symptomatic individual at time t would infect, if the conditions remained as they were at time t. hence, the stepwise, undulations, crests, troughs & spikes of this estimate is termed as instantaneous or real time measures. there are various ways to calculate this eective instantanous & other time varying reproduction number. they are such to be: . . stochastic dynamic contact model-based method [ ] . a stochastic susceptible-infected-removed (sir) model is considered in this case, in place of a deterministic one. stochastic dynamic model has advantages over the standard deterministic one, in that, it accomodates improved variabilities and allows for better quantication of uncertainies of that number as compared to the standard deterministic model. here, s(t), i(t) & r(t) denote the number of susceptible, infectious and recovered population at time respectively, and that n = s(t) + i(t) + r(t) is the total population. the infectious period of an infected individual is a random variable t ∼ exp (γ) & the reproduction rate is r(t) ≈ βe(t) = β γ , where β & γ are the transmission rate and recovery rate. the mathematical essence of the model can captured by a set of four coupled rst order linear homogenous dierential equations given such to be here, ∼ signies deterministic (average) counterparts. we set s ( ) equals the population of the region, r( ) = , i ( ) is to times the average number of conrmed cases from day to day , and γ the inverse of mean infectious period, obtained from the parametrization of serial interval distribution collected directly from data described in section ( ) . the main diculty with this time varying reproduction number is that it assumes a constant transmissibility, where it may vary & often peak, during the generation time interval and just before the onset of symptoms [ ] . this model also can not accomodate various disease traits like asymptomaticity (non-detection), or human interferences like isolation measures or migration etc. hence we do not look at this method any further here. [ ] . here it is assumed that the total number of secondary infectees that were infected by a single praimary infector follows a poisson distribution. the number of individuals infected on (discrete) date t is usually replaced by the number of daily incidences reported, on the same date t. also, the generation time interval is suitably replaced by the corresponding serial interval interval for all practical purposes. let n t be the number of reported incidences on day t. assuming that the serial interval has a maximum of k days and the number of new cases generated by an infected individual is assumed to follow a poisson distribution with parameter [ ] r. the probability that the serial interval of an individual lies in j days is w j , which can be estimated from the empirical distribution of serial interval or by setting up a discretized gamma prior on it. note only the nonnegative values of serial interval are used here. thus, the likelihood function can be reduced into a thinned poisson distribution as such all rights reserved. no reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted june , . . https://doi.org/ . / . . . doi: medrxiv preprint the instantaneous reproduction number can then be estimated by maximising the likelihood function as follows . exponential growth rate-based (egr) method [ ] . in the early days of the epidemic the number of infected cases rise exponentially. the growth rate (malthusian coecient) r can be estimated by tting a non linear least square tting into the daily incidence curve. the probability density function of serial interval of the outbreak is denoted by f λ (t), then the eective reproduction number is given by the euler lotka (type) equation in case we have a non parametric serial interval distribution then we can dene our eective reproduction number as where λ i are the observed serial intervals. using the publically available data on github [ ] , to create a contact list between infector infectee pairs in the pan indian context, between th january- to th april- . the data is then tted with a log normal / gamma distribution to parametrize the values of mean and standard deviation. from the available data on github [ ] , the daily conrmed case incidences were collected for nine states, for the duration of th march- to st june- . applying the poissonian ml method, the instantaneous r i (t)was plotted for each one of them, as given in gure number two to ten. all rights reserved. no reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted june , . . https://doi.org/ . / . . . doi: medrxiv preprint (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted june , . . https://doi.org/ . / . . . doi: medrxiv preprint (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted june , . . https://doi.org/ . / . . . doi: medrxiv preprint (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted june , . . https://doi.org/ . / . . . doi: medrxiv preprint it has been seen from above that, in all the state except gujrat & karnataka, the lockdown as a containment measure has been quite successful. in india the lockdown started from from th march- . after to days of the commencement of the same, seven provinces of india has shown us a signicant downtrend of instantaneous reproduction number. this lag between cause & its eect corresponds to serial time or the generation time interval, that varies from state to state. however, these two states show a opposite trend. after the passage of about one generational time interval, the instantaneous reproduction number peaks sharply. this might correspond to migration [ ] at the beginning of the lockdown. in karnataka, however the value of instantaneous reproduction number uctuates moderately. this is perhaps due to clustering [ ] or inadequate testing policies [ ] , which may be true for both of the states. in what follows next are two province specic case studies. to be doubly sure, that our time dependent reproduction numbers [ ] are calculated correctly over time, we shall t the daily incidence graph of the two provinces which showed contrarian nature in reproduction numbers, during the initiation phase of lockdown. the time dependent reproduction numbers shall be used to t the daily incidences. the result is given in g. ( ) . we show that, the lockdown in india was fairly successful barring a couple of places, due to migration or superspreading etc. we note here that a similar study, with a bigger scope has been reported elsewhere [ ] . but it assumes the parametric serial interval, which is dierent from ours. we have deduced our own serial interval (cf. sec. ) by scraping the pan india raw data and by building our own line list & contact list. hence our result is presumed to be signicantly dierent from theirs and more representative of the actual scenarios [ ] . the eect partial lifting of the lockdown (unlock) is also seen in the results, in terms of increment in r i (t). . acknowledgement sm wishes to thank k. bhattacharya, k. samanta for useful discussions. he also thanks i. mukhopadhyay for useful help with references. the analysis was was performed in r [ ] statistical programming language environment. public health interventions and sars spread avian inuenza virus (h n ): a threat to human health the inuenza a (h n ) pdm outbreak in india middle east respiratory syndrome coronavirus (mers-cov): a review ebola virus disease epidemic in west africa: lessons learned and issues arising from west african countries incubation period of novel coronavirus ( -ncov) infections among travellers from wuhan, china severe acute respiratory syndrome coronavirus : biology and therapeutic options epidemiology, causes, clinical manifestation and diagnosis, prevention and control of coronavirus disease (covid- ) during the early outbreak period: a scoping review should we all be wearing face masks? here's why experts are so conicted. newspaper what we know about face shields and coronavirus. newspaper lockdown guidelines: govt's standard operating procedure for social distancing at workplace. newspaper etbfsi: corona impact -modi announces janata curfew -urges citizens to stay at home on sunday. newspaper guidelines for demarcation of containment zones to control covid- . newspaper notes on r toi: how eective has india's lockdown been in controlling covid. newspaper why india's lockdown has been a spectacular failure. newspaper teunis: dierent epidemic curves for severe acute respiratory syndrome reveal similar impacts of control measures how will country-based mitigation measures inuence the course of the covid- epidemic? how generation intervals shape the relationship between growth rates and reproductive numbers force of infection is key to understanding the epidemiology of plasmodium falciparum malaria in papua new guinean children estimation of the time-varying reproduction number of covid- outbreak in china improved inference of time-varying reproduction numbers during infectious disease outbreaks git hub [ ] pti: , people stranded in uttarakhand to return to gujarat in buses. newspaper ballari's jindal steel plant emerges covid- cluster. newspaper even as covid- spikes in rural karnataka, state tests less. newspaper boÃlle: the r package: a toolbox to estimate reproduction numbers for epidemic outbreaks lockdown eect on covid- spreadin india: national data masking state-level trends how data became one of the most powerful tools to fight an epidemic. newspaper r: a language and environment for statistical computing. vienna, austria: r foundation for statistical computing key: cord- -stqj ue authors: prakash, meher k; kaushal, shaurya; bhattacharya, soumyadeep; chandran, akshay; kumar, aloke; ansumali, santosh title: a minimal and adaptive prediction strategy for critical resource planning in a pandemic date: - - journal: nan doi: . / . . . sha: doc_id: cord_uid: stqj ue current epidemiological models can in principle model the temporal evolution of a pandemic. however, any such model will rely on parameters that are unknown, which in practice are estimated using stochastic and poorly measured quantities. as a result, an early prediction of the long-term evolution of a pandemic will quickly lose relevance, while a late model will be too late to be useful for disaster management. unless a model is designed to be adaptive, it is bound either to lose relevance over time, or lose trust and thus not have a second-chance for retraining. we propose a strategy for estimating the number of infections and the number of deaths, that does away with time-series modeling, and instead makes use of a 'phase portrait approach'. we demonstrate that, with this approach, there is a universality to the evolution of the disease across countries, that can then be usedto make reliable predictions. these same models can also be used to plan the requirements for critical resources during the pandemic. the approach is designed for simplicity of interpretation, and adaptivity over time. using our model, we predict the number of infections and deaths in italy and new york state, based on an adaptive algorithm which uses early available data, and show that our predictions closely match the actual outcomes. we also carry out a similar exercise for india, where in addition to projecting the number of infections and deaths, we also project the expected range of critical resource requirements for hospitalizations in a location. the world health organization (who) has declared covid- , a disease caused by the novel coronavirus (sars-cov- ), as a pandemic. covid- is causing infections and deaths globally on a scale that has not been seen in this century. the virus made a zoonotic transition into humans and then continued person to person transfer. analysis of the spanish flu suggests a need for timely action from the governments. , with no vaccines or treatment options, prevention via social distancing and city or nation-wide lockdowns have become the catchword for containing the spread of infections. these restrictions on the movement of people of course have adverse effects on the economies. there have been several efforts to model the spread of the pandemic [ ] [ ] [ ] to understand the gravity of the situation that one is facing as well as to suggest travel restrictions , and to guide policy measures - on the extent of testing or the implementation of the lockdowns. further, the unexpected surge in patients requiring hospitalizations and intensive care is leading to the collapse of health care systems globally. gearing up for providing the health care supplies including critical equipment such as ventilators requires an accurate estimate of the number of infections. but the key information that is required for this modelling, which is the number of infected people at any time, is in itself marred by uncertainties. there has been an acute shortage of testing kits required for extensive screening. as a result, although symptomatic patients may have been traced or tested, a significant fraction of the global infections are believed to have been through asymptomatic patients. although a pandemic such as covid- appears as a once-in-a-century event, the st century has already seen a few others which drew very close: sars in , swine flu in , mers in . thus, having tools to quickly plan for the critical resource requirements is important. the nature of transmissible diseases such as covid- is that the number of infected people grows exponentially. the leaders of many countries have made a sudden switch from complacency to active mitigation policies, as if they were caught by surprise while expecting linear trends. the nature of the numerics associated with the long-term pandemic trajectory evolution is that the errors in estimation also increase exponentially, no matter how detailed the model is. an early prediction of a pandemic that forecasts the evolution for many months will accrue errors very quickly, and a model that is developed much later in the evolution will only be relevant for post facto analyses, and not for policy decisions or disaster management. therefore, the models should be designed to be adaptive, such that their projections are constantly corrected. they should also be simple enough to be interpretable by the various stakeholders (with differing abilities to understand the detailed models) involved in the management of the pandemic. further, due to the differences in the allowed freedom of movement or the rigour of policy implementation in various states or provinces in different countries, the predictions should be recalculated. the focus of this work is to set up a framework that can be useful to estimate the need for critical resources such as hospital beds and ventilators, the need for which changes with time and with the region (province, state or district) of interest. inter alia, we also make predictions based on early data for italy and new york state, update them, and demonstrate good agreement between the subsequent evolution of the pandemic and our predictions. our approach can be summarized as follows: the covid- data from most countries suggests that, especially in the growing phase of the pandemic, the number of active cases and the number of hospitalizations are both proportional to the total number of infections: approximately around - % and - %, respectively. this conclusion is arrived at by eliminating "time" as a variable, and focusing only on the above-cited ratios. moreover, it is quite easy to update our estimates on a weekly basis. a simple law that can capture the spread of infections is where α is the rate of transmission, i is the number of infected people, r is the number of people who recover from infection and p is the total population. further details can be added by noting that i is the sum of symptomatic and asymptomatic patients, and that α differs for the two groups and across communities depending on their social contact structure or policies of various degrees of isolation. however, this is not done here. when the infections are in a growing phase, the data suggests that i r, which in turn suggests that i can be used as a proxy for the number of active infections. coming to the focus of this work, we sugguse that the number of active infections can be the basis for critical resource planning during the evolution of the pandemic. also, regardless of the scale of the pandemic, it remains manageable only when p i. this allows one to approximate eq. as where r is the effective rate of the spread of infection. the parameter r depends on the infection under consideration. during the early days of the pandemic, when detailed biological understanding of the disease spread is still elusive, it can be a challenge to estimate the parameter r accurately. one approach is to estimate i is an integer-valued variable, and further, its values would be small at the outset. moreover an analysis of the raw time series data from several countries reveals that there is no universal value for r across countries. this is illustrated in figure: a. even the cumulative value of i does not reveal any universal trend (see figure: b). therefore it is necessary to adopt a different approach. our approach is to eliminate time as a variable. by doing so, we found three universal trends, one partly expected and the others not so immediately apparent: . the infection rate is proportional to the number of current infections ( figure: a). in principle this is expected from eq . in practice this finding is interesting because the rates from all countries are comparable, especially during the early days of the spread of infection before any policy changes are implemented. this also underscores the fact that the biology of the disease is similar from one country to the other, without further mutations in the virus or differences in immunity levels of the different populations. this commonality across the data from different countries has been previously highlighted by many. . interestingly, there is also another universal pattern across the countries, where the rate of deaths caused by the infection is proportional to the number of deaths ( figure: b) . further, the fluctuations in this data are lesser in extent compared to that in figure: a, a feature that we will exploit later. it is not immediately apparent why the rate of deaths should be proportional to the cumulative deaths. one possible explanantion is that the number of deaths depends directly on the number of infections. . the cumulative number of deaths at any time is directly related to the number of overall infections(figure: a). considering the wide variation across countries in the duration of the hospitalization and critical care, it is not apparent why this should be so. however, the similarity in the data from across the countries is . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint . https://doi.org/ . / . . . doi: medrxiv preprint as noted earlier, it is not easy to scale up the number of test kits and to arrange tests in large populations. thus the official count of the number of infected individuals at any given time is far from being very precise. an alternative way of estimating the number of infections is via monitoring of the the number of deaths. the key idea is that the fraction of infected people who may die are most likely to be in hospitals and thus a fraction of the reported infection by the state. while this fraction might depend on country and quality of testing, it is more likely that an infected person who may die due to other health complications will be in hospital. to do this, we incorporate a scaling factor to account for the extensiveness of the testing. it is believed that, to date, the most extensive screening tests have been performed by south korea, detecting most of the symptomatic and asymptomatic infected individuals. thus, using the data from south korea as a reference standard, the deaths versus infections curve has been readjusted as seen in figure: a cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint . figure : using the i-d plot for estimating number of infections. these plots depict the universal relation between i and d, barring a corrective factor. this uncertainty factor that was calculated from the covid- infection data of south korea as a reference can be used for estimating infections when the number of deaths are known. β = corrected infections reported infections was assumed to be for south korea. at any time in the evolution of the pandemic shown on this i versus d graph, using the known number of deaths in that country, the number of infections were rescaled to match with the infections in south korea and this factor was used as β. the figure illustrates the procedure used for β when deaths are slightly lower than , equal to and . of a higher median age (for example in italy) are infected compared to individuals of a slightly lower median age (for example in germany). however, remembering that the objective is to plan for the resources needed for the individuals who are currently infected, the lower death rate is likely to predict the number of infections on the safer side as far as resource planning is concerned. the estimate of infections from these two different approaches will be used to set the limits on the number of infected individuals, and the resources required. the preceding subsection addressed the difficulties in measuring i, the number of infections. in this subsection we discuss the estimation of r, the rate at which the infection spreads. a typical way to estimate r is to fit the time series of i(t) by an exponential. as pointed out earlier, when i is small, fitting a time series to an integervalued (and possibly noisy) variable is problematic. instead, we use the slope of di/dt versus i ( figure: a) to estimate r. instead of applying eq exactly with all the attendant uncertainties in i and r, we use a piecewise strategy. a simple discretization of this in time steps of ∆t gives i(t + ∆t) = i(t) + r t,t+∆t · i(t). it was seen from the covid- infection data from several countries that as the number of daily new infections fluctuates considerably, the trends from the data can be captured only by following the data over a few days. we work with a ∆t = days, since inferring r from the data drawn over a shorter duration or making a prediction more fine-grained than this do not seem to be relevant from a practical point of view. the multiplicative factors r t,t+∆t estimated week by week, using the universal patterns observed in figure: b, are given in supplementary table . we make two predictions for i(t): one using the reported infections from the country, and another using the reported deaths. the scaling factor β is re-calibrated once a week in an adaptive fashion, using the data until that week. in essence, the exponential function of time has been replaced by a geometric series for each week. the results from adaptively predicting the number of covid- related deaths from italy are illustrated in . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint . https://doi.org/ . / . . . doi: medrxiv preprint . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint . https://doi.org/ . / . . . doi: medrxiv preprint reported infections (ny) predictions using infections predictions using deaths a b figure : adaptive predictions for new york state. the adaptive predictions were performed for the data from new york state, just as described in figure: . a lockdown can cause two changes in our model. first, depending on the levels of restrictions on movement such as social distancing, r t,t+∆t will reduce or even drop to zero. the r t,t+∆t in the weeks following important restrictions on movement from different countries was also used as a lower bound on how fast the new infections can decrease. second, when a country imposes a lockdown by imposing restrictions on international or interstate movement, the different physically separated regions will follow their own independent growth trajectory. the weekly growth factor to be applied to each specific state or province will depend upon the number of infected in the province at the time of the decision, with some regions lagging behind the others by weeks. one thing we learnt from covid- is that each country or province, before it becomes a transmission vector for other places, has a lead of several weeks or months over those places. the lead of course depends on the extent of people travelling between these places. a disease that was once confined to a specific region, diffuses locally and jumps to distant locations almost like a jump-diffusion process. there is a self-replication phenomenon at different scales, wherein the universalities in the patterns between the nations and at a national level, now are reflected between the individual states or provinces. this observation can be used by the governments of these provinces to plan for the worst-case scenario. each pandemic is different in its ability to persist, infect, and diffuse through the population. while the medical knowledge and critical equipment required may differ from the earlier pandemics, as long as the universalities discussed in this work are seen in the data, one can certainly learn from other countries, states or provinces that experienced the same pandemic a few weeks or months earlier. . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint table ). table lists the projected number of ventilators required in each state as well as the gross national estimate for a period of four weeks beginning from april th, . . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. in this work we aimed to provide tools for critical resource planning in the wake of a pandemic. we do this by introducing three attributes that are not common in epidemiological modeling: we used the number of deaths as an additional and surrogate marker for the otherwise uncertain number of infections. given the data is stochastic, and with several unknowns, we focused on making weekly predictions based on a geometric series with piecewise adaptive ratio. adapting the geometric ratio week by week serves two objectives: (i) smoothing out the daily fluctuations in data, and (ii) bridging the gap between predictions made using early data, and current data. the simplicity of the formalism, and its data-aware design permit us to make adaptive predictions for the next few weeks, starting at any time. although covid- is a rare pandemic, the quick, simple and adaptive principles laid out here, embracing the limitations of testing and drawing on the universalities of the progression, should be relevant for any pandemic spreading through person to person contact. . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not peer-reviewed) the copyright holder for this preprint . https://doi.org/ . / . . . doi: medrxiv preprint a familial cluster of pneumonia associated with the novel coronavirus indicating person-to-person transmission: a study of a family cluster the effect of public health measures on the influenza pandemic in us cities public health interventions and epidemic intensity during the influenza pandemic real-time forecasts of the covid- epidemic in china from early dynamics of transmission and control of covid- : a mathematical modelling study an interactive web-based dashboard to track covid- in real time real-time tentative assessment of the epidemiological characteristics of novel coronavirus infections in wuhan, china impact of international travel and border control measures on the global spread of the novel coronavirus outbreak the effect of travel restrictions on the spread of the novel coronavirus on fast multishot epidemic interventions for post lock-down mitigation: implications for simple covid- models age-structured impact of social distancing on the covid- epidemic in india substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (sars-cov ) countries test tactics in war against covid- the authors declare no competing interests. key: cord- -qcghtkk authors: russo, lucia; anastassopoulou, cleo; tsakris, athanasios; bifulco, gennaro nicola; campana, emilio fortunato; toraldo, gerardo; siettos, constantinos title: tracing day-zero and forecasting the covid- outbreak in lombardy, italy: a compartmental modelling and numerical optimization approach date: - - journal: plos one doi: . /journal.pone. sha: doc_id: cord_uid: qcghtkk introduction: italy became the second epicenter of the novel coronavirus disease (covid- ) pandemic after china, surpassing by far china’s death toll. the disease swept through lombardy, which remained in lockdown for about two months, starting from the th of march. as of that day, the isolation measures taken in lombardy were extended to the entire country. here, assuming that effectively there was one case “zero” that introduced the virus to the region, we provide estimates for: (a) the day-zero of the outbreak in lombardy, italy; (b) the actual number of asymptomatic infected cases in the total population until march ; (c) the basic (r( ))and the effective reproduction number (r(e)) based on the estimation of the actual number of infected cases. to demonstrate the efficiency of the model and approach, we also provide a tentative forecast two months ahead of time, i.e. until may , the date on which relaxation of the measures commenced, on the basis of the covid- community mobility reports released by google on march . methods: to deal with the uncertainty in the number of the actual asymptomatic infected cases in the total population volpert et al. ( ), we address a modified compartmental susceptible/ exposed/ infectious asymptomatic/ infected symptomatic/ recovered/ dead (seiird) model with two compartments of infectious persons: one modelling the cases in the population that are asymptomatic or experience very mild symptoms and another modelling the infected cases with mild to severe symptoms. the parameters of the model corresponding to the recovery period, the time from the onset of symptoms to death and the time from exposure to the time that an individual starts to be infectious, have been set as reported from clinical studies on covid- . for the estimation of the day-zero of the outbreak in lombardy, as well as of the “effective” per-day transmission rate for which no clinical data are available, we have used the proposed seiird simulator to fit the numbers of new daily cases from february to the th of march. this was accomplished by solving a mixed-integer optimization problem. based on the computed parameters, we also provide an estimation of the basic reproduction number r( ) and the evolution of the effective reproduction number r(e). to examine the efficiency of the model and approach, we ran the simulator to “forecast” the epidemic two months ahead of time, i.e. from march to may . for this purpose, we considered the reduction in mobility in lombardy as released on march by google covid- community mobility reports, and the effects of social distancing and of the very strict measures taken by the government on march and march , . results: based on the proposed methodological procedure, we estimated that the expected day-zero was january (min-max rage: january to january , interquartile range: january to january ). the actual cumulative number of asymptomatic infected cases in the total population in lombardy on march was of the order of times the confirmed cumulative number of infected cases, while the expected value of the basic reproduction number r( ) was found to be . (min-max range: . - . ). on may , the date on which relaxation of the measures commenced the effective reproduction number was found to be . (interquartiles: . , . ). the model approximated adequately two months ahead of time the evolution of reported cases of infected until may , the day on which the phase i of the relaxation of measures was implemented over all of italy. furthermore the model predicted that until may , around % of the population in lombardy has recovered (interquartile range: ∼ % to ∼ %). italy became the second epicenter of the novel coronavirus disease (covid- ) pandemic after china, surpassing by far china's death toll. the disease swept through lombardy, which remained in lockdown for about two months, starting from the th of march. as of that day, the isolation measures taken in lombardy were extended to the entire country. here, assuming that effectively there was one case "zero" that introduced the virus to the region, we provide estimates for: (a) the day-zero of the outbreak in lombardy, italy; (b) the actual number of asymptomatic infected cases in the total population until march ; (c) the basic (r )and the effective reproduction number (r e ) based on the estimation of the actual number of infected cases. to demonstrate the efficiency of the model and approach, we also provide a tentative forecast two months ahead of time, i.e. until may , the date on which relaxation of the measures commenced, on the basis of the covid- community mobility reports released by google on march . to deal with the uncertainty in the number of the actual asymptomatic infected cases in the total population volpert et al. ( ), we address a modified compartmental susceptible/ exposed/ infectious asymptomatic/ infected symptomatic/ recovered/ dead (seiird) model with two compartments of infectious persons: one modelling the cases in the population that are asymptomatic or experience very mild symptoms and another modelling the infected cases with mild to severe symptoms. the parameters of the model corresponding to the recovery period, the time from the onset of symptoms to death and the time from a a a a a exposure to the time that an individual starts to be infectious, have been set as reported from clinical studies on covid- . for the estimation of the day-zero of the outbreak in lombardy, as well as of the "effective" per-day transmission rate for which no clinical data are available, we have used the proposed seiird simulator to fit the numbers of new daily cases from february to the th of march. this was accomplished by solving a mixed-integer optimization problem. based on the computed parameters, we also provide an estimation of the basic reproduction number r and the evolution of the effective reproduction number r e . to examine the efficiency of the model and approach, we ran the simulator to "forecast" the epidemic two months ahead of time, i.e. from march to may . for this purpose, we considered the reduction in mobility in lombardy as released on march by google covid- community mobility reports, and the effects of social distancing and of the very strict measures taken by the government on march and march , . based on the proposed methodological procedure, we estimated that the expected dayzero was january (min-max rage: january to january , interquartile range: january to january ). the actual cumulative number of asymptomatic infected cases in the total population in lombardy on march was of the order of times the confirmed cumulative number of infected cases, while the expected value of the basic reproduction number r was found to be . (min-max range: . - . ). on may , the date on which relaxation of the measures commenced the effective reproduction number was found to be . (interquartiles: . , . ). the model approximated adequately two months ahead of time the evolution of reported cases of infected until may , the day on which the phase i of the relaxation of measures was implemented over all of italy. furthermore the model predicted that until may , around % of the population in lombardy has recovered (interquartile range: * % to * %). the butterfly effect in chaos theory underscores the sensitive dependence on initial conditions, highlighting the importance of even a small change in the initial state of a nonlinear system. the emergence of a novel coronavirus, sars-cov- , that caused a viral pneumonia outbreak in wuhan, hubei province, china in early december has evolved into the covid- acute respiratory disease pandemic due to its alarming levels of spread and severity, with more than . million cases and , deaths globally, as of may , ( [ , ] ). the seemingly far from the epicenter, old continent became the second-most impacted region after asia pacific, mostly as a result of a dramatic divergence of the epidemic trajectory in italy first, where there have been , total confirmed infected cases and , deaths, and then in spain where there have been , total confirmed infected cases and , deaths, as of may , ( [ , ] ). the second largest outbreak outside of mainland china officially started on january , , after two chinese visitors staying at a central hotel in rome tested positive for sars-cov- ; the couple remained in isolation and was declared recovered on february [ ] . a -year-old man repatriated back to italy from wuhan who was admitted to the hospital in codogno, lombardy on february was the first secondary infection case ("patient "). "patient " was never identified by tracing the first italian citizen's movements and contacts. in less than a week, the explosive increase in the number of cases in several bordering regions and in the two autonomous provinces of trento and bolzano (the northerner in italy) placed enormous strain on the decentralized health system. following a dramatic spike in deaths from covid- , italy transformed into a "red zone", and the movement restrictions were expanded to the entire country on the th of march. all public gatherings were cancelled and school and university closures were extended through at least the next month. in an attempt to assess the dynamics of the outbreak for forecasting purposes, it is important to estimate epidemiological parameters that cannot be computed directly based on clinical data, such as the transmission rate (or as otherwise called "effective contact rate" of the disease and the basic reproduction number, r . the transmission rate is defined as the product of the probability of transmitting the virus given a contact between a susceptible and an infected individual and the average rate of contacts between susceptibles and infected. r is defined as the expected number of exposed cases generated by one infected case in a population where all individuals are susceptible [ ] . since the first confirmed covid- case many mathematical modelling studies have already appeared. the first models mainly focused on the estimation of the basic reproduction number r using dynamic mechanistic mathematical models ( [ ] [ ] [ ] [ ] ), but also simple exponential growth models (see e.g. [ , ] ). compartmental epidemiological models like sir, sird, seir and seird have been proposed to estimate other important epidemiological parameters, such as the transmission rate and for forecasting purposes (see e.g. [ , ] ). other studies have used metapopulation models, which include data of human mobility between cities and/or regions to forecast the evolution of the outbreak in other regions/countries far from the original epicenter in china [ , , , ] , including the modelling of the influence of travel restrictions and other control measures in reducing the spread [ ] . among the perplexing problems that mathematical models face when they are used to estimate epidemiological parameters and to forecast the evolution of the outbreak, two stand out: (a) the uncertainty regarding the day-zero of the outbreak, the knowledge of which is crucial to assess the stage and dynamics of the epidemic, especially during the first growth period, and (b) the uncertainty that characterizes the actual number of the asymptomatic infected cases in the total population (see e.g. [ , ] ). at this point we should note that what is done until now with dynamical epidemiological models is the investigation of several scenarios including different "days-zero" or just fixing the day-zero and run different levels of asymptomatic cases e.t.c. to cope with the above problems, we herein address a methodological framework that provides estimates for the day-zero of the outbreak and the number of the asymptomatic cases in the total population in a systematic way. towards this goal, and for our demonstrations, we address a conceptually simple seird model with a total of five compartments, with one of them modelling the asymptomatic infected cases in the population and another one modelling the part of the infected cases that will experience mild to severe symptoms, a significant share of which will be hospitalized, admitted to intensive care units (icus) or die from the disease. the proposed approach is applied to lombardy, the epicenter of the outbreak in italy. furthermore, we provide a twomonths ahead of time forecast from march (the day of lockdown of all italy) to may (the first day of the relaxation of the strict isolation measures). the above tasks were accomplished by the numerical solution of a mixed-integer optimization problem using the publicly available data of daily new cases for the period february -march , and the covid- community mobility reports released by google on march . we address a compartmental seiird model that includes two categories of infected cases, namely the asymptomatic (unknown) cases in the total population and the cases that develop mild to more severe symptoms, a significant share of which are hospitalized, admitted to icus and a part of them dies. in agreement with other studies and observations, our modelling hypothesis is that the confirmed cases of infected are only a (small) subset of the actual number of asymptomatic infected cases in the total population [ , , ] . regarding the confirmed cases of infected as of february , a study conducted by the chinese cdc which was based on a total of , cases in china, found that about . % of the cases were mild and could recover at home, . % severe and . % critical [ ] . on the basis of the above findings, in our modelling approach, it is assumed that the asymptomatic or very mildly symptomatic cases recover from the disease relatively soon and without medical care, while for the other category of infected, on average their recovery lasts longer than the non-confirmed, they may also be hospitalized, admitted to icus or die from the disease. based on the above, let us consider a well-mixed population of size n. the state of the system at time t, is described by (see also fig for a schematic) s(t) representing the number of susceptible persons, e(t) the number of exposed, i(t) the number of asymptomatic infected persons in the total population who experience very mild or no symptoms and recover relatively soon without any other complications, i c (t) the number of infected cases who may develop mild to more severe symptoms and a significant part of them is hospitalized, admitted to icus or dies, r(t) the number of asymptomatic cases in the total population that recover, r c (t) the number of the recovered cases that come from the compartment of i c and d(t) the reported number of deaths. for our analysis, and for such a short period, we assume that the total number of the population remains constant. based on demographic data, the total population of lombardy is n = m; its surface area is , . kmq and the population density is * (inhabitants/kmq). the rate at which a susceptible (s) becomes exposed (e) to the virus is proportional to the density of infectious persons i. the proportionality constant is the "effective" disease transmission rate, say b ¼ � cp, where � c is the average number of contacts per day and p is the probability of infection upon a contact between a susceptible and an infected. our main assumption here is that only a fraction, say � of the actual number of exposed cases e will experience mild to tracing day-zero and forecasting the covid- outbreak in lombardy more severe symptoms denoted as i c (t) and a significant part of them will be hospitalized, admitted to icus or die. thus, we assume that the infected persons that belong to the compartment i c go into quarantine at home or they are hospitalized, and, thus, it is assumed that for any practical means they don't transmit further the disease. here, it should be noted that a wide testing policy may also result in the identification of asymptomatic cases belonging to the compartment i that would then be assigned to compartment i c . however, as a generally reported rule in italy, tests were conducted only for those who presented for treatment with symptoms like fever and coughing. thus, people who did not seek medical attention were tested very scarcely [ ] [ ] [ ] . thus, for any practical means the compartment i c reflects the reported confirmed infected cases. a fraction of the i c cases that is given by the fatality ratio f = d(t)/(i c (t) + r c (t) + d(t)) dies with a mortality rate δ the inverse of which is the average time from the onset of symptoms to death, while the remaining part (( − f)) of the i c compartment recovers with a rate γ c , the inverse of which corresponds to the average time from the onset of symptoms to full recovery. we note that while more compartments could conceptually be included, we aimed at keeping a low level of complexity in order to avoid the introduction of more parameters, and thus a model that would suffer from the "curse of dimensionality". at this point we should note that on march , the date of the general lockdown, the number of confirmed infected cases was , , the number of cases in icus was and the number of hospitalized persons was , [ ] . that is, until march , the number of confirmed cases was approximately equal to the number of hospitalized cases and the cases that were admitted to icus. therefore, until march , any difference between the asymptomatic cases as represented by our model by the compartment i and the compartment i c would approximately reflect a level of under-reporting of the actual asymptomatic cases in the total population. we should also note that in the available data of reported cases [ ] there is no distinction between cases that recovered at home and those that recovered at and were dismissed from hospitals. thus, in the absence of such information, if one were to consider as a separate category the cases that are hospitalized, an extra parameter would have to be introduced (the fraction of recovered cases dismissed from hospitals). on one hand, such a piece of information is not available, and, on the other, such an attempt would add an extra degree of freedom that would need calibration or to be fixed at a certain value; however, due to the small size of the data and the "curse of dimensionality", this would also introduce unnecessary computational burden and further modelling uncertainty. thus, our discrete mean field compartmental seiird model reads: sðt À Þiðt À Þ À seðt À Þ ð Þ the above system is defined in discrete time points t = , , . . ., with the corresponding initial condition at the very start of the outbreak (day-zero): the the term fδi c (t − ) in eq represents the fraction f of the i c cases that dies with a mortality rate γ and the term ( − f)γ c i c (t − ) represents the complementary part (( − f)) of the i c cases that recovers with a rate γ c . the parameters of the model are: is the average per-day "effective" rate at which an exposed person becomes infectious, is the average per-day "effective" recovery rate within the group of asymptomatic cases in the total population, is the average per-day "effective" recovery rate within the subset of the i c infected cases that finally recover, • δ(d − ) is the average per-day "effective" mortality rate within the subset of i c infected cases that finally die, • f is the probability that a i c case will die. here, this, is given by the "emergent" case fatality ratio, computed as f is the fraction of the actual (all) cases of exposed in the total population that enter to the compartment i c . here, we should note the following: as new cases of recovered and dead at each time t appear with a time delay (which is generally unknown but an estimate can be obtained by clinical studies) with respect to the corresponding infected cases, the above per-day rates are not the actual ones; thus, they are denoted as "effective/apparent" rates. the values of the epidemiological parameters σ, γ, γ c , δ that were fixed in the proposed model were chosen based on clinical studies. in particular, in many studies that use seird models, the parameter σ is set equal to the inverse of the mean incubation period (time from exposure to the development of symptoms) of a virus. however, the incubation period does not generally coincide with the time from exposure to the time that someone starts to be infectious. regarding covid- , it has been suggested that an exposed person can be infectious well before the development of symptoms [ ] . with respect to the incubation period for sars-cov- , a study in china [ ] suggests that it may range from - days, with a median of . days. another study in china, using data from , patients with laboratory-confirmed -ncov ard from hospitals in provinces/provincial municipalities suggested that the median incubation period is days (interquartile range: to ). in our model, as explained above, /σ represents the period from exposure to the onset of the contagious period. thus, based on the above clinical studies, for our simulations, we have set /σ = . regarding the recovery period, in a study that is based on , laboratory-confirmed cases, the who-china joint mission has reported a median time of weeks from onset to clinical recovery for mild cases, and - weeks for severe or critical cases [ ] . based on the above, and on the fact that within the subset of confirmed cases the mild cases are the % [ ] , we have set the recovery period for the confirmed cases' compartment to be δ c = / in order to balance the recovery period with the corresponding characterization of the cases (mild, severe/critical). the average recovery period of the unreported/non-confirmed part of the infected population, which in our assumptions experiences the disease like the flu or a common cold, is set equal to one week [ ], i.e. we have set δ = / . this choice is based also on reports on the serial interval of covid- . the serial interval of covid- is defined as the time duration between a primary case-patient (infector) having symptoms and a secondary case-patient having again symptoms. for example, it has been reported that the serial interval for covid is estimated at . - . days [ ] ; for the case of lombardy, the average serial number has been estimated to be . days [ ] . in our model, the /σ = period refers to the period from exposure to the onset of the contagiousness. in this period, obviously there are no symptoms. thus, the serial interval in our model is days (this is the average number of days in which an infectious becomes recovered and no longer transmits the disease). importantly, there are studies (see e.g. nushiura et al. [ ] ) suggesting that a substantial proportion of secondary transmission may occur prior to illness onset. thus, the days period that we have taken as the average period that an infectious person can transmit the disease before he/she recovers, reflects exactly this period; it refers to the serial interval for the cases that are asymptomatic and for cases with mild symptoms. finally, the median time from the onset of symptoms until death for italy has been reported to be eight days [ ] , thus in our model we have set γ = / . we have set f = % for the optimization. for the forecasting (i.e. for the period march to may ), the value of f was not fixed but it was computed dynamically each day t through the model simulations as the transmission rate β, as it cannot be obtained in general by clinical studies, but only by mathematical models, was estimated through the optimization process. regarding day-zero in lombardy, that is also unknown and estimated by the optimization process, what has been officially reported is just the date on which the first infected person was confirmed to be positive for sars-cov- . that day was february , , which is the starting date of public data release of confirmed cases. the day-zero of the outbreak, the per-day "effective" transmission rate β, and the ratio � were computed by the numerical solution of a mixed-integer optimization problem with the aid of genetic algorithms to fit the reported data of daily new cases (see the discussion in [ ] ) from february to march , the day of the lockdown of lombardy. as already mentioned, on march , the number of confirmed infected cases in the population was , , the number of cases in icus was and the number of hospitalized persons was , [ ] . that is, until march the number of mild and severe cases that were hospitalized and admitted to icus was approximately equal to the number of confirmed infected cases. thus, for the period of calibration, it is reasonable to assume that for any practical means the number of confirmed cases was approximately the same as for those that experienced mild to more severe symptoms and were admitted for medical care. thus, until march , the parameter � reflects also the level of under-reporting of the asymptomatic cases in the total population. here, for our computations, we have used the genetic algorithm "ga" provided by the global optimization toolbox of matlab [ ] to minimize the following objective function: where, Δx seird (t), (x = i, r, d) are the new cases resulting from the seird simulator at time t. the weights w , w , w correspond to scalars serving in the general case as weights to the relevant functions for balancing the different scales between the number of infected, recovered cases and deaths. the convergence tolerance was set to .e − , the population size (distributed between the lower and upper bounds) was selected as resulting from min(max ( � nvars, ), ) for mixed-integer problems (here nvars = ), ceil( . � populationsize) individuals are guaranteed to survive to the next generation, the migration fraction was set to . , while the number of generations that take place between migrations of individuals between subpopulations was set to [ ] . at this point we should note that the above optimization problem may in principle have multiple nearby optimal solutions (mnos). finding and assessing the information contained from mnos (known also as niching) is a particular challenging problem [ ] . here, we created a grid of initial guesses within the intervals in which the optimal estimates were sought: for the day-zero (t ) we used a step of days within the interval december , until the th of february, i.e. ± days around the th of january, for β we used a step of . within the interval ( . , . ) and for � we used a step of . within the interval ( . , . ). the numerical optimization procedure was repeated times for each combination of initial guesses. for our computations, we kept the best fitting outcome for each combination of initial guesses. next, in order to reveal structured patterns of distributions vs. uniformly random distributions, we fitted the resulting probability distributions of the optimal values using several functions, including the normal, log-normal, weibull, beta, gamma, burr [ ] , exponential and birnbaum-saunders [ ] distributions and kept the one resulting in the maximum log-likelihood (see in the supporting information for more details). for the computed parameters of the corresponding best distributions, we also provide the corresponding % confidence intervals. such fitting can demonstrate that the obtained values are not uniformly distributed. finally, we have run simulations based on all obtained values to assess the efficiency of the model and obtained results and the forecasting uncertainty until may . for our computations, we used the parallel computing toolbox of matlab a [ ] utilizing intel xeon cpu x cores at . ghz. estimation of the basic reproduction number. here, we note that we provide an estimation of the basic reproduction number r based on the estimation of the total number of (asymptomatic) infected cases in the population. thus, it is expected that the estimated r will be larger than the ones reported using just the confirmed number of cases; the latter may underestimate the actual r . initially, when the spread of the epidemic starts, all the population is considered to be susceptible, i.e. s � n. on the basis of this assumption, we computed the basic reproduction number based on the estimates of the epidemiological parameters computed using the data from the st of february to the th of march with the aid of the seiird model given by eqs ( )-( ) as follows. note that there are three infected compartments, namely e, i, i c and two of them (e,i) determine the outbreak. thus, considering the corresponding equations given by eqs ( ), ( ) and ( ) , and that at the very first days of the epidemic s � n and d � , the jacobian of the system as evaluated at the disease-free state reads: the eigenvalues (that is the roots of the characteristic polynomial of the jacobian matrix) dictate if the disease-free equilibrium is stable or not, that is if an emerging infectious disease can spread in the population. in particular, the disease-free state is stable, meaning that an infectious disease will not result in an outbreak, if and only if all the norms of the eigenvalues of the jacobian j of the discrete time system are bounded by one. jury's stability criterion [ ] (the analogue of routh-hurwitz criterion for discrete-time systems) can be used to determine the stability of the linearized discrete time system by analysis of the coefficients of its characteristic polynomial. the characteristic polynomial of the jacobian matrix reads: where a ¼ the necessary conditions for stability read: the sufficient conditions for stability are given by the following two inequalities: the first inequality ( ) results in the necessary condition bð À �Þ g < : it can be shown that the second necessary condition ( ) and the sufficient condition ( ) are always satisfied for the range of values of the epidemiological parameters considered here. thus, the necessary condition ( ) is also a sufficient condition for stability. hence, the disease-free state is stable, if and only if, condition ( ) is satisfied. note that in this necessary and sufficient condition ( ) , the fraction ( − �)/δ is the average infection time of the compartment i. thus, the above expression reflects the basic reproduction number r which is qualitatively defined by r = β � /infection time. hence, our model results in the following expression for the basic reproduction number: note that for � = , the above expression simplifies to r for the simple sir model. estimation of effective reproduction number. for the estimation of the effective reproduction number r e , representing the average number of secondary infections from an infectious individual when in the population exist already non-susceptible individuals. for the calculation of r e , we now use the next generation matrix approach [ ] . the next generation matrix g with elements = g ij is formed by the average number of secondary infections of type i from an infected individual of type j. formally, it is constructed by: where f is the vector containing the transmission rates from the model, and v is the vector containing the transition rates between the infected compartments. in our model: the effective reproduction number r e is the spectral radius, i.e. the dominant eigenvalue of g. thus, r e ðtÞ ¼ bð À �Þ g sðt À Þ n À dðt À Þ À r c ðt À Þ À i c ðt À Þ : ð Þ as discussed, we used the proposed approach to forecast the evolution of the pandemic in lombardy from march to may , i.e. from the first day of lockdown to the first day of the relaxation of the social isolation. our estimation regarding the as of march reduction of the "effective" transmission rate was based on the combined effects of prevention efforts and behavioral changes. in particular, our estimation was based on (a) the covid- community mobility reports released by google on march [ ] , and, (b) an assessment of the synergistic effects of such control measures as the implementation of preventive containment in workplaces, stringent "social distancing", and the ban on social gatherings, as well as the public awareness campaign prompting people to adopt cautious behaviors to reduce the risk of disease transmission (see also [ ] [ ] [ ] [ ] ). the effect of the distribution of contacts at home, work, when travelling, and during leisure activities can be also assessed. for example, based on an analysis for the social contacts and mixing patterns relevant to the spread of infectious diseases that was conducted in various countries, it has been found that for italy, around - % of all physical social contacts during a day are attributed to workplaces, around - % to schools, - % to transportation, % to leisure activities, - % to home, % to other activities (contacts made at locations other than home, work, school, travel, or leisure) and a - % to contacts made at multiple other locations during the day, not just at a single location [ ] . on the basis of the google covid- community mobility report released on march [ ] , the average reduction in the mobility in lombardy during the period february -march , compared to the period before february , was * % in retail & recreation activities, * % in transit stations, * % in workplaces, while it was increased in parks by * % and was almost the same in groceries and pharmacies. in the period march to march , the mobility was reduced by an average of * % in retail & recreation activities, by * % in transit stations, by * % in workplaces, by * % in parks and by * % in groceries and pharmacies. thus, taking into account the coarse effect of different activities in the physical contact [ ] , the average reduction was of the order of * % in mobility when compared to the period february -march . in fact, on march , based on the release of mobile phone data, the vice-president of lombardy, announced that the average mobility in the region (for distances more than meters) had been reduced by a * % with respect to the period before february [ ] . on march , the government announced the implementation of even stricter measures that included the closure of all public and private offices, closing all parks, walking only around the residency and not even in pairs, and the prohibition of mobility to second houses [ , ] . according to the google covid- community mobility reports [ ] , from march - until april , activities were reduced by an average of * % in retail & recreation activities, by * % in transit stations, by * % in workplaces, by * % in parks and by * % in groceries and pharmacies. thus, taking into account the coarse effect of different activities in the physical contact [ ] , the average reduction was of the order of * % in the mobility when compared to the period of february -march . a further reduction may be attributed to behavioral changes [ ] . for example, it has been shown that social distancing and cautiousness reduce the disease transmission rate by about % [ ] . thus, based on the above, it is reasonable to consider a ( - . as discussed, for our computations we ran times the numerical optimization procedure for each combination of initial guesses based on the daily reported new cases from february to for all the near-optimal points obtained using the genetic algorithm optimization, the residuals were of the order of * , , . regarding the values of the optimal parameters, we fitted their cumulative probability distributions using several functions, including the normal, log-normal, weibull, beta, gamma, burr, exponential and birnbaum-saunders functions, and kept the one yielding the maximum log-likelihood (see in the s file). table summarizes the mean values of the optimal parameters and their interquartiles, for the day-zero, the transmission rate (β) and the fraction (�) of the actual cases of exposed in the total population that enter to the compartment i c , and also the information on the values of the parameters of the best fitting distributions. note that the optimal values of day-zero were between january -january (interquartile range: january to january ) (see s fig in s file) , the optimal values of β were between . and . (interquartile range: . to . ) (see s fig in s file) , and the optimal values of � were between . and . (interquartile range: . to . ) (see s fig in s file) . the best fit to the distribution of optimal values of the day-zero was obtained using a normal cdf with mean . (i.e. sim days before the th of january) ( % ci: . , . ) and variance . ( % ci: . , . ); thus taking the round value at days, the expected day-zero corresponds to january (interquartile range: january to january ). the best fit to the distribution of the optimal values of β, was given by fitting a burr cdf with α = . ( % ci: . , . ), c = . ( % ci: . , . ), k = . ( % ci: . , . ) having a mean value of . (interquartile range: . to . ). finally, the best fit to the distribution of the optimal values of � was given by fitting a birnbaum-saunders cdf with parameters μ = . ( % ci: . , . ) (scale parameter) and α = . ( % ci: . , . ) (shape parameter), resulting in an expected value . (interquartile range: . to . ) (see in s file). thus, based on the derived values of the "effective" per-day disease transmission rate, the basic reproduction number r is . (min-max range: . - . ). finally, we ran the simulator for all values of the optimal triplets from the corresponding distributions as found by the solution of the optimization problem using the data from february until march , then from march to march for validation purposes and from march to may for forecasting purposes. figs ( )-( ) depict the simulation results based on the optimal estimates, until the th of march. to validate the model with respect to the reported data of confirmed cases from march to march , we have considered an ( - . )( - . ) reduction in the "effective" transmission rate and as initial conditions the values resulting from the simulation on march as described table . optimal parameter values and interquartiles for day-zero, transmission rate (β) and fraction (�) of the cases of exposed individuals in the total population that enter to the compartment i c . the resulting value of r along with the minimum and maximum value is also given. tracing day-zero and forecasting the covid- outbreak in lombardy in the methodology. thus, the model approximated fairly well the dynamics of the pandemic in the period from february to march . as discussed in the methodology, we also attempted based on our methodology and modelling approach to forecast the evolution of the outbreak until may , the first day of the relaxation of the measures. to do so, as described in the methodology, we have considered a ( - . ) ( - . ) reduction in the "effective" transmission rate starting on march (compared to the period february -march ), the day of announcement of even stricter measures in the region of lombardy (see in methodology). the result of our forecast is depicted in fig . as shown, the model predicts fairly the evolution of the epidemic two months ahead of march (the model parameters and the day-zero were estimated using the reported data from february to march ). note that, regarding the confirmed cases, the mean values of i c (t) from over all simulations almost coincide with the reported values of the confirmed cases (see top panel of fig ) . the reported recovered cases are in the lower part of the model predictions (r c (t)) (see medium panel of fig ) . also, the model predicts fairly the total number of deaths until may (see bottom panel of fig . however, at may there is a significant difference between the mean value of deaths as obtained from simulations and the actual number of deaths. this is due to the following reasons. first, the difference that is observed with respect to the reported number of deaths and model forecasts in the period from march to april can be attributed to facts that the model did not take into account, such as the saturation of icus in that period, which could potentially lead to a larger number of deaths. indeed, on march , new deaths were reported. however, we note that the model was calibrated with the data reported until march . until march , the case fatality ratio was of the order of * - % and the number of cases admitted to icus was relatively small. after march when there were many cases admitted to icus, the death toll increased significantly, it almost doubled, thus reaching the * - % of the confirmed cases. second, regarding the difference that is observed between the model predictions and actual number of deaths at may , this is due to the fact that we did not calibrate the model parameters to take into account the changes in the death and recovery rates. it is expected that when the evolution of the epidemic is abrupt, as it was in lombardy in the period march until the early of april, due to the saturation of the tracing day-zero and forecasting the covid- outbreak in lombardy health system and icus the mortality rates would be higher and the recovery rates longer. indeed, during this period, it has been reported, that the median time from the onset of symptoms until death in lombardy was eight days [ ] which a very short period. for example, in other studies it has been reported that the time between symptom onset and death ranged from about weeks to weeks which from two to seven times longer than the one reported in lombardy [ , ] . so, this difference explains the difference between simulations and reported number of deaths at may . here, for our demonstrations we have used the information about the epidemiological parameters, that was available at the early phase of the epidemic. thus, we must underline the very good agreement of the model predictions with the actual number of confirmed infected cases for the entire period march -may . this number, in contrast to the number of recovered and number of deaths is not shaped/biased by the capacity of icus and this is the number on which policy makers should focus in order to decide ahead of time of the necessary resources (e.g. necessary number of beds and icus) needed to keep the fatality ratio as low as possible. the model predicts that until may , an average of % of the population in lombardy has already recovered (interquartile range: * % to * %) (see fig ) . finally, in fig , we report the evolution of the effective reproduction number r e until may . on may the estimated mean value of r e was . (interquartile range: . to . ), thus marking the onset of a critical date for the post-lockdown period. tracing day-zero and forecasting the covid- outbreak in lombardy the crucial questions about an outbreak is how, when (day-zero), why it started, and if and when it will end. answers to these important questions would add critical knowledge to our arsenal to combat the pandemic. the tracing of day-zero, in particular, is of outmost importance. it is well known, that minor perturbations in the initial conditions of a complex system, such as the ones of an outbreak, may result in major changes in the observed dynamics. undoubtedly, a high level of uncertainty for day-zero, as well as the uncertainty in the actual numbers of exposed people in the total population, raise several barriers to our ability to correctly assess the state and dynamics of the outbreak and to forecast its evolution and its end. such pieces of information would lower the barriers and help public health authorities respond fast and efficiently to the emergency. for example, an over-or under-estimation of day-zero would result in an under-or over-estimation of the the transmission rate β and therefore of the basic reproduction number r and consequently the effective reproduction number r e . furthermore, the correct estimation of day-zero is important for the assessment of the number of asymptomatic and actual recovered cases in the total population. this is turn will bias the assessment and ultimately the design of efficient control policies in real time. this study aimed exactly at shedding more light into this problem. to achieve this goal, we addressed a conceptually simple compartmental seiird model with two infectious compartments in order to bridge the gap between the number of asymptomatic cases in the total population and the cases that will experience mild to more severe symptoms. what is done until now with mathematical epidemiological models is the investigation of several scenarios, by changing e.g. the (or assuming a fixed) initial day (day-zero), the level of asymptomatic cases etc. our work, is the first that introduces a methodological framework, to estimate the day-zero as well as the level of asymptomatic cases in the total population in a tracing day-zero and forecasting the covid- outbreak in lombardy systematic way. following the proposed methodological framework, we found that the dayzero in lombardy was around the middle of january, a period that precedes by one month the fate of the first confirmed case in the hardest-hit northern italian region of lombardy. interestingly enough, when we submitted a preprint of the work at medrxiv on march ( [ ] , another study that was also submitted at the same day at medrxiv, that was based on genomic and phylogenetic data analysis, reports the same time period, between the second half of january and early february, , as the time when the novel coronavirus sars-cov- entered northern italy [ ] . our analysis further revealed that the actual number of asymptomatic infected cases in the total population in the period until march was around times the number of confirmed infected cases, which until march was also approximately equal to the number of cases that were hospitalized and admitted to icus. our model and methodological approach assume that there was one effective "zero" infected case that introduced the virus to the region; one could certainly argue that there were more than one cases that introduced the virus to the region on the same day; such scenarios can be investigated in a straightforward manner based on our proposed methodological approach. furthermore, the proposed approach could be used for the quantification of the uncertainty of the evolving dynamics, taking into account the reported, from clinical studies, distributions of the epidemiological parameters rather than their expected values. a critical point that is connected with the above is that with such a small number of infectious at the initial stage of the simulations, a stochastic model or a hybrid stochastic model could be more realistic in which uncertainty could be modelled in the form of realistic perturbations (see for example [ ] ). furthermore, we did not consider the effect of an ongoing sampling strategy in the total population for the estimation of the level of under-reporting (represented in our model with the variable �). as mentioned in the methodology, within the first period of the outbreak the number of confirmed cases was approximately equal to the number of hospitalized cases, i.e. there was no sampling strategy. furthermore, as also reported for the later period, tests were conducted only for those who seeked medical care and had symptoms like fever and coughing. thus, people who did not seek for medical attention were tested very scarcely [ ] [ ] [ ] . we will consider such type of modelling and analysis in a future work. regarding the forecasting in lombardy from march until may (the first day of relaxation of the measures), we have taken into account the very latest facts on the drop of human mobility, as released by google [ ] until april for the region of lombardy; these were shaped by the very strict measures announced on march - that included the closure of all parks, public and private offices and the prohibition of any pedestrian activity, even individually [ ] . our modelling approach approximated fairly well the reported number of infected cases in lombardy two months ahead of time. the mean value of the evolution of the compartment that in our model reflects the confirmed cases, almost coincides with the reported cases for the entire period from march to may . the differences that are observed between the reported number of deaths and simulations as discussed in the results should be attributed to the very short times from the onset of symptoms to death that have been reported in lombardy at the early phases of the pandemic in the region ( days instead of to weeks that have been reported in other studies for china that is linked with the saturation of the icus [ , ] . furthermore another important factor that is missing from the data used is that the number of deaths is largely affected by the criteria of death notification. indeed as it has been reported, the global coronavirus death toll could be % higher than the confirmed [ ] . however, this fact did not affect the model predictions with respect to the confirmed cases, but the rate with which the confirmed cases die; the later is also dependent on the rate of the outbreak and the relevant capacity of the icus. to this end, we would like to make a final comment with respect to the basic reproduction number r , the significance and meaning of which are very often misinterpreted and misused, thereby leading to erroneous conclusions. here, we found a r * . , which is higher compared to the values reported by many studies in china, and also in italy, and in lombardy in particular. for example, zhao et al. estimated r to range between . ( % ci: . , . ) and . ( % ci: . , . ) in the early phase of the outbreak [ ] . similar estimates were obtained for r by imai et al., . ( % ci: . , . ) [ [ ] . regarding italy, d'arienzo and coniglio [ ] used a sir model to fit the reported data in nine italian cities and found that r ranged from . to . . in another study, the authors provided an estimate of the basic reproduction number by analyzing the first , laboratoryconfirmed cases. by doing so they estimated the basic reproduction number at . [ ] . first, we would like to stress that r is not a biological constant for a disease as it is affected not only by the pathogen, but also by many other factors, such as environmental conditions, demographics, as well as, importantly, by the social behavior of the population (see for example the discussion in [ ] ). thus, a value for r that is found in one part of the world (e.g. in china), and even in a region of the same country, e.g. in tuscany, italy, cannot be generalized as a global biological constant for other parts of the world, or even for other regions of the same country. obviously, the environmental factors and social behavior of the population in lombardy are different from the ones, for example, prevailing in hubei. second, most of the studies that provide estimates of the basic reproduction number are based solely on the reported cases, thus the actual number of infected cases in the total population, that may be asymptomatic but transmit the disease, is not considered; this fact may lead to an underestimation of the basic reproduction number. moreover, in our approach as compared to clinical studies, the computation of r comes out as the necessary and sufficient condition of the stability as derived from the proposed model whose parameters are computed based on the available reported data, thus with a delay between the first actual case (see also the discussion in [ ] . our relatively simple conceptually model and approach do not aspire to accurately describe the complexity of the emergent dynamics, which in any case is an overwhelmingly difficult, if not impossible, task in the long run, even with the use of detailed agent-based models. we tried to keep the structure of the model as simple as possible in order to be able to model (in a coarse way) the uncertainty in both the "day-zero" and the number of asymptomatic actual cases in the total population using as few parameters as possible. the results of our analysis have indeed proved that the modelling approach succeeded in providing fair predictions of the evolution of the epidemic two months ahead of time. such an early assessment would help authorities to evaluate the required measures to control the epidemic, such as the scale of diagnostic tests that have to be performed and the number of icu beds required. while more complicated models can, in principle, be constructed to take into account more detailed information, such as the number of hospitalized patients and patients in icus, for any practical means such an approach would suffer from the "curse of dimensionality" as it would introduce many more parameters that would need calibration based on a relatively small size of data especially at the beginning of an outbreak. an attempt to compute the values of some of these additional parameters, which can only be roughly estimated by clinical studies at the early stages of an emerging novel infectious disease, would introduce additional uncertainty, thereby further complicating matters rather than solving the problem. to this end, we hope that our conceptually simple, but pragmatic, modelling approach and methodological framework help to provide improved insights into the currently uncontrolled pandemic and to contribute to the mitigation of some of its severe consequences. author contributions coronavirus disease (covid- ). situation report coronavirus covid- global cases by covid- : preparedness, decentralisation, and the hunt for patient zero complexity of the basic reproduction number (r ) nowcasting and forecasting the potential domestic and international spread of the -ncov outbreak originating in wuhan, china: a modelling study transmissibility of -ncov estimating the scale of covid- epidemic in the united states: simulations based on air traffic directly from wuhan, china data-based analysis, modelling and forecasting of the covid- outbreak preliminary estimation of the basic reproduction number of novel coronavirus ( -ncov) in china, from to : a data-driven analysis in the early phase of the outbreak covid- and italy: what next? breaking down of the healthcare system: mathematical modelling for controlling the novel coronavirus ( -ncov) outbreak in wuhan estimating the risk on outbreak spreading of -ncov in china using transportation data using predicted imports of -ncov cases to determine locations that may not be identifying all imported cases the effect of travel restrictions on the spread of the novel coronavirus (covid- ) outbreak coronavirus-scientific insights and societal aspects current us coronavirus cases are "just the tip of the iceberg,-former usaid director says the epidemiological characteristics of an outbreak of novel coronavirus diseases (covid- ) in china in one italian town, we showed mass testing could eradicate the coronavirus perch& x e ; non stiamo facendo pi& x f ; tamponi? covid- -situazione in italia prevention, how covid- spreads early transmission dynamics in wuhan, china, of novel coronavirus infected pneumonia report of the who-china joint mission on coronavirus disease how will country-based mitigation measures influence the course of the covid- epidemic? the early phase of the covid- outbreak in lombardy, italy ( ) serial interval of novel coronavirus (covid- ) infections superiore di sanit characteristics of covid- patients dying in italy report based on available data on avoidable errors in the modelling of outbreaks of emerging pathogens, with special reference to ebola niching methods and multimodal optimization performance cumulative frequency functions a new family of life distributions inners and stability of dynamic systems the construction of next-generation matrices for compartmental epidemic models advancing the right to health: the vital role of law, world health organization case investigations of infectious diseases occurring in workplaces, united states quantifying social distancing arising from pandemic influenza nonpharmaceutical measures for pandemic influenza in nonhealthcare settings-social distancing measures, emerging infectious diseases social contacts and mixing patterns relevant to the spread of infectious diseases il % ancora si sposta a milano militari per le strade. posti di blocco a roma coronavirus, ordinanza lombardia con nuove limitazioni modeling the interplay between human behavior and the spread of infectious diseases real estimates of mortality following covid- infection tracing day-zero and forecasting the covid- outbreak in lombardy, italy: a compartmental modelling and numerical optimization approach genomic characterisation and phylogenetic analysis of sars-cov- in italy bounded noises in physics, biology, and engineering who statement regarding cluster of pneumonia cases in wuhan global coronavirus death toll could be % higher than reported | free to read early transmission dynamics in wuhan, china, of novel coronavirus infected pneumonia assessment of the sars-cov- basic reproduction number, r , based on the early phase of covid- outbreak in italy a new framework and software to estimate timevarying reproduction numbers during epidemics key: cord- - qt vf authors: chakraborty, amartya; bose, sunanda title: around the world in days: an exploratory study of impact of covid- on online global news sentiment date: - - journal: j comput soc sci doi: . /s - - - sha: doc_id: cord_uid: qt vf the world is going through an unprecedented crisis due to covid- breakout, and people all over the world are forced to stay indoors for safety. in such a situation, the rise and fall of the number of affected cases or deaths has turned into a constant headline in most news channels. consequently, there is a lack of positivity in the world-wide news published in different forms of media. texts based on news articles, movie reviews, tweets, etc. are often analyzed by researchers, and mined for determining opinion or sentiment, using supervised and unsupervised methods. the proposed work takes up the challenge of mining a comprehensive set of online news texts, for determining the prevailing sentiment in the context of the ongoing pandemic, along with a statistical analysis of the relation between actual effect of covid- and online news sentiment. the amount and observed delay of impact of the ground truth situation on online news is determined on a global scale, as well as at country level. the authors conclude that at a global level, the news sentiment has a good amount of dependence on the number of new cases or deaths, while the effect varies for different countries, and is also dependent on regional socio-political factors. we are in the midst of a global crisis, owing to the outbreak and spread of the covid- virus, and the substantially damaging influence of this viral infection has forced the world health organization (who) to declare the ongoing situation as a pandemic. as per the official statement of who, "covid- is the infectious disease caused by the most recently discovered coronavirus. this new virus and disease were unknown before the outbreak began in wuhan, china, in december . covid- is now a pandemic affecting many countries globally" [ ] . as a precautionary or preventive response to this declared pandemic, countries all over the world have introduced restrictions on mobility and transportation, referred to as lockdowns. consequently, citizens are being asked to stay indoors as a measure of safety from the infection. in this age of a multitude of news channels and popular virtual social frameworks aimed at better connectivity, a massive share of the time spent indoors is undoubtedly invested in engaging with such media. this is corroborated by the recent study [ ] which has revealed that there has been about % increase in news consumption by watching television or on smartphones, due to constant indoor presence. a primary obsession of people during this pandemic is about the changing statistics of affected or deceased people world-wide, and needless to say, such articles form the crux of news that the different media channels publish. this virus outbreak has also raised a plethora of other controversial issues, leading to continuing debates and discussions with consequences at both local and global levels. as a whole, it is apparent that there is only a limited number of news delivered with a positive note. the impact of negativity in the news is a long-standing concern, and has been addressed from time to time [ , ] , but the prevailing situation is predicted to leave a long-lasting and damaging impact on mental health and human psychology as a whole [ ] . meanwhile, the day-to-day statistics of deaths or count of affected patients due to the pandemic is expected to influence the news sentiment too. the authors have taken up this challenge of determining the news sentiment during a fixed period of study, as well as analyzing the influence of world-wide and countrywide statistics on the news sentiment during the selected duration. the organization of the paper is as follows: "literature review" section gives a brief description of the studied related works and motivations drawn for the current work; the details of each data corpus used in the work are provided in "data description" section; "data processing" section lists the techniques used for processing the comprehensive data corpora; the experiments and observations are discussed in "experiment : sentiment analysis" section, "experiment : statistical analysis" section, "experiment : n-gram analysis" section, "experiment : case studies" section; finally, the concluding remarks are offered in "conclusion" section. the challenge of opinion mining as an application field of data mining is well addressed, and there have been multiple works in this domain with a variety of solutions based on the increasing availability of growing datasets. a vast majority of these works are dedicated to the challenge of sentiment analysis in text collections of different types. similar to challenges in other domains, the task of sentiment analysis can also be approached as either as a supervised classification problem, or an unsupervised approach for sentiment identification [ ] . the number of works that have addressed the problem of sentiment analysis with a supervised approach is more than the ones that have used unsupervised, exploratory techniques. for a supervised sentiment classification problem, the primary requirement is that the text corpus needs to be labeled, i.e., each text string in the whole data set needs to be annotated as belonging to a particular class-positive, negative, or neutral in this case. a study of the state-of-the-art works reveals that for previously annotated texts mostly based on twitter data, blog posts, web logs, movie reviews etc., the researchers have used some common machine learning techniques, namely support vector machine, naive bayes [ ] [ ] [ ] [ ] [ ] [ ] , or even deep convolutional neural networks [ , ] , etc. it is a general observation that such techniques are more efficient in sentiment analysis tasks than the other unsupervised approaches. also, the overall performance of supervised algorithms in challenges of opinion mining is generally lower than that in other domains [ ] . on the other hand, the task of analyzing sentiments is more challenging with the use of unsupervised learning techniques. also, such techniques are often more suited for mining the sentiment from bulky sources of data. identification of semantic orientation [ ] , comparative study and low performance of the sentiwordnet lexicon in sentiment analysis [ ] , development of novel emoji and linguistic contentbased lexicons using unsupervised approach [ , ] , sentiment polarity detection system using unsupervised approach on turkish movie reviews [ ] , etc. are all different interesting research works that use unsupervised approach. the application of standard lexicons such as sentiwordnet [ ] , afinn [ ] , etc. in unsupervised sentiment classification is widely studied and evaluated in different works [ ] [ ] [ ] . these lexicon based techniques are employed in solving interesting problems, such as analyzing the sentiment of the characters in shakespeare's plays [ ] , opinion mining from clinical discharge summaries [ ] , development of bias-aware systems [ ] , etc. other popular methods for sentiment identification include k-means [ , , ] , latent dirichlet allocation (lda) [ , ] , etc. in all such cases, it is seen that the inherent simplicity, lack of training, and lower computation requirement involved in unsupervised approaches make it easier to use on and learn from data corpus of substantially large size [ ] . during a survey of state-of-the-art research using unsupervised lexicon based approach on text data, it is seen that most of these works are based on exploratory sentiment analysis and evaluation of classification techniques, used on different types of data. however, there is a relatively small amount of research that has worked with news data, and almost all such works are based on financial news and stock price prediction [ ] [ ] [ ] [ ] [ ] , etc. similarly, there are only a few works regarding the statistical effect of real-world events on the overall sentiment of global news, mostly related to the financial sector [ ] [ ] [ ] , etc. in this technologically developed era, people are engrossed in the news media, and agenda setting [ ] has a crucial role to play in times of a crisis. researchers have often determined the role played by mass media in determining or setting the agenda in response to a particular incident or event, and this is rapidly propagated among the audience [ ] . obviously, it entails a number of problems as well as ludicrous opportunities for the media agencies, as explored in [ ] . in a related context, the work by kirk et al. [ ] analyzes the agenda setting and media policies in response to a disaster. while the proposed work does not focus on these issues, the authors wish to highlight the underlying role of media in maintaining global public sentiment and mental health given the ongoing covid- -related crisis. the news media need to be responsible as well as alert to ensure the proper propagation of awareness and shaping of public sentiment particularly involving second-level agenda setting [ , ] . given these observations and the ongoing pandemic, the authors were motivated to make the following research contributions: -the current work determines the general sentiment of news articles during the ongoing pandemic with unsupervised and transfer learning-based approaches, -this is the only work, as per the authors' knowledge, that determines the implications of temporal statistics in a pandemic situation, on news sentiment throughout the world during a fixed period of study. the current work statistically determines how and after what amount of delay, the number of affected patients, and number of deaths due to covid- , impacts the news sentiment in regional and world-wide news, -the authors also analyze other relevant factors that contribute to rise or fall of global news sentiment related to particular countries. the proposed work uses data regarding the daily news articles published online globally, as well as the statistical details of day-to-day cases and deaths due to covid- throughout the world. accordingly, two comprehensive data sets have been used in this work, as described below: - the unlabeled news data described in the previous section have been processed in this part of the work. all of the steps discussed below are performed for each day's data, to generate usable corpora for the experiments. -data merging: there are files containing news snippets from each day, and these are initially merged to generate a single data repository per day. thereafter, some steps are followed for processing, as described below. -removing numbers: initially, the news text contained in the merged corpus for each day of the study is processed using regular expression-based operations. -removing stop words: a common approach is followed to remove the words that are not useful in sentiment analysis process, but which make up a significant part of any text. examples of such words are: and, for, is, the, to, at, in, etc. -stemming: as a last step of processing the news articles, stemming is applied to derive the root form of the inflected or derived words in each cleaned string. such derived words are used to propagate different grammatical concepts such as mood, tense, voice, etc. as a simple example, the words working, works, and worked all have the same stemmed form work. once all the above steps have been performed, the processed texts for the total duration of the current study in processed files are merged as a single file containing over . million distinct news articles. the merged news data corpus consisting of comprehensive, cleaned strings from the previous step is unlabeled in nature, i.e., the news articles are not originally assigned any particular sentiment label. for this purpose, any machine learning and classification-based sentiment analysis are not directly possible on this data set. for sentiment prediction, the cleaned text articles for each day are now scored using two different approaches, namely the afinn lexicon [ ] in an unsupervised learning approach, and by the naive bayes [ ] -based transfer learning approach which has been trained on a popular movie reviews dataset [ ] . a lexicon is a comprehensive collection of words, and afinn is one such widely used lexicon consisting of over words where each word contains a corresponding sentiment score value. this polarity score lies between + to − , and every string in our cleaned news text is now analyzed by applying the afinn lexicon, to generate corresponding sentiment scores. as an example, the string it was a good memory is analyzed and scored word by word using afinn, where the scores are , , , , and , respectively, to give a total score of + . evidently, the stop words have no role to play in such analysis, and thus, they have been removed during text processing in the previous section. the determined scores (using afinn lexicon), are now converted to sentiment category. for this purpose, all texts with score less than are labeled negative, those with score equal to are neutral, and all remaining texts are annotated as positive. a notable observation is that such approaches consider only single-word construct or unigrams for sentiment scoring. this is a prime weakness of such approach, as it fails to capture the inherent essence of different multi-word constructs in english, and fails to recognize emotions and complexities of the language. in contrast, the trained naive bayes classifier uses its knowledge about sentiment polarity from the aforementioned movie reviews corpus, and correspondingly applies it to assign a sentiment category to each news article per day. unlike afinn, this supervised classification approach considers the complete text at a time and is more sensitive to emotions, inherent figures of speech and multi-word constructs in the language used. also, this approach gives a different view of the studied corpus of news texts, and returns the sentiment category for each news article. in this manner, for every piece of cleaned news text, we now have an overall sentiment score (for afinn) and sentiment category (for naive bayes classifier) which is either positive, or negative, for that string. the news data corpus for different days do not consist of the same number of text articles, and also each news article has a different sentiment category predicted by afinn and the trained naive bayes classifier. therefore, there is a need for normalization, before any comparative study of news sentiment on different days is conducted. for this purpose, a negativity index for each day is calculated and is used as an indicator of the overall negative sentiment in news on that day. the index for the ith day is calculated as: ( ) neg i = number of articles of negative category total number of news articles similarly, indices for positive sentiment and neutral type of news articles are determined using equations: these index values are calculated for the comprehensive data on news articles for the duration of study. the overall spread of these sentiment indices, as determined by the analysis using unigram-based afinn, are shown in fig. , while fig. illustrates the same as analyzed by naive bayes-based classifier. notably, with the use of the latter, substantially large negativity (about %) and low positivity (about illustration of the significance of the three sentiments in global news during the period of study, determined using afinn lexicon. news with neutral sentiment has minimum presence, and positive news sentiment seems to be slowly catching up with the negativity fig. illustration of the significance of the three sentiments on global news during the period of study, determined using naive bayes. news with neutral sentiment has minimum presence, and there is a substantial gap between the positivity and negativity in news sentiment %) values are detected, whereas the neutrality decreases by more than % and is deemed almost irrelevant to the study at hand. also, in both cases, it is obvious that any fall in negativity, results in an increase in positive sentiment, and vice versa. therefore, news of neutral sentiment plays a negligible role. consequently, a statistical study of the sentiment indices determined in both approaches reveals that negative sentiment has the highest mean, followed by the mean number of positive news articles. also, these two sentiments show almost similar deviation during the studied duration, using both the scoring techniques. finally, it is evident from both pairs of figs. , and and that the overall variation in sentiment patterns is more profound in the detection by afinn lexicon, in spite of its poor sentiment detection performance, and is selected for the experiments in the next section. the most commonly occurring words in the news articles with negative sentiment, for the complete duration of study, are illustrated in fig. . this is the next set of experiments where two separate sets of data are utilized, namely: -the world-wide news-based negativity index values from the previous experiment determined using afinn lexicon based approach, as the variation of sentiment polarity is found to be more in that case, and, -the number of new cases and number of deaths per million of the population, these corpora are analyzed to determine the underlying relation between the variation of news sentiment and ground reality of cases and deaths due to covid- pandemic. to statistically determine the link between the news negativity and number of cases or number of deaths due to the pandemic, it is essential to determine the distribution of each of these variables. figure shows the respective distributions. from the figures, it is noticed that all the three variables used in this work follow a near-normal or near-gaussian [ ] distribution. therefore, it is feasible to directly determine the statistical relation between these variables. initially, an attempt has been made to visually determine the relation between distribution of features from two different data corpora. in fig. , the number of confirmed covid- cases during the span of the study has been represented as bar plots. the negativity index values in global news have been plotted for the same this word-cloud highlights the specific words which are present in each day's most negative news articles. the relatively large size of words, such as death, fatality, case, coronavirus, died, infection, and hospitalized are representative of their frequencies of occurrence during the -day period of study duration as a line plot. it is seen that peaks in news negativity are quite often related to the rise in number of cases, as seen in the variations of both variables for different set of days. also, the decreasing step pattern in number of cases during days - and - is distinctively reflected in the news negativity plot too. similar to the previous case, fig. gives the number of daily deaths in bar stacks, while the line plot is the same as the previous figure. it is seen that there is not much similarity in trends between the two data during the first days. in contrast, some similarity in the data patterns is evident in the duration of days - , after which there is no visible similarity. however, in both the above cases, it is observed that similar patterns in news occur at a delay of a few days. this can be attributed to the fact that day-to-day statistics do not get immediately reported on the same day, and generally takes at least a day or two, to appear and make impact on the global news sentiment. this observation leads to the need for determining the optimal time window, at which the trends in the corpora are most similar. from the previous section, it is observed that the trends in news negativity are more or less affected by the variations in the number of cases and the number of deaths. also, the impact of the trends in number of cases or deaths is visible at a delay of a few days. therefore, it is necessary to statistically determine the exact delay at which the news sentiment reflects the reality of the situation. the statistical measure of similarity in data for two variables can be determined by calculating their correlation coefficient. in this part of the experiment, the authors have experimentally determined the correlation coefficient r n , between the news sentiment and number of cases or number of deaths, using a set of sliding windows on the news sentiment index values, where each such window is shifted n days ahead of the actual duration of the conducted study, for values of n = ( , , , , ). this means, to re-create the most visibly aligned variations, a statistical study is done using a same set of values for the number of cases or deaths, along with values of news negativity index considered during temporally shifted sets of days each. in all cases, the correlation is calculated using the pearson correlation coefficient [ ] between two variables x and y, given by the formula: this coefficient value for any two variables remains between − and + , where a positive value close to indicates that both variables change simultaneously in same direction, a negative correlation stands for two variables changing in opposite direction, and zero correlation denotes no similarity in the variables. in practice, any correlation value above . is treated as a moderately strong positive correlation. using these concepts, along with the previous observations about delay in impact of actual change in parameters on the news sentiments, the optimal maximum positive correlation value is determined to derive the actual delay. a similar use of correlation is seen in the works by fu et al. and zhang et al. [ , ] . from table , it is obvious that in general, there exists more correlation between the daily negative sentiment in news and number of covid- -related deaths, considering data world-wide, and that the positive correlation is maximum between these variables when the news negativity indices are considered using a -day shifted sliding window, i.e., it takes days for the trends in the number of deaths, to have impact on the global news sentiment. similarly, this shift is confirmed for the global number of cases at a delay of days. this experiment validates the observation about a delay in the impact of number of confirmed patients and number of deaths, on the news sentiment, and also determines the delay in said impact on a global scale. in the final part of this experiment, the correlation values and optimal time-windows determined in the previous section are used for plotting time-shifted news sentiment curves along with the daily number of cases and number of deaths. accordingly, the news sentiment about daily number of cases is considered at a shift of days, while that concerned with daily death count is plotted at a shift of days to get the ideally aligned plots. these are shown in figs. and , respectively. it can be seen from fig. that there are almost perfect matches in pattern in the duration of days , - , - , and onwards, though due to differences in scale, the variations are not equally spaced. the visible resemblance in variations is also noted in fig. , especially in days - , - , the abrupt spikes in - , - . however, it is a general observation that the negativity in news prevails even when the global statistics in both cases and deaths are declining which can be attributed to other factors as determined in succeeding experiments. therefore, it can be said that the negativity index, considering global news, is quite indicative of the changes in the number of new cases and deaths during the ongoing pandemic, while the declining statistics do not seem to have much effect on the overall negativity. a n-gram can be defined as a continuous sequence of n words from a given sentence or text. in this part of the experiments, the authors have determined the most common tri-grams that occur in the news during the period of study. this analysis highlights the several events, topics, or persons that have been most widely publicized by the online global news in relation with the pandemic scenario. the trigrams have been listed along with their corresponding weighted frequency (calculated using tri-gram frequency and total occurrence of most common tri-grams), as shown in table . it is obvious from the table that most of the tri-grams are regarding the pandemic, with massive usage of phrases such as tested positive coronavirus, tested positive covid, confirmed case covid, etc. in the global news. the news agenda during the studied period of time revolves around this central theme, and involves daily covid- -related updates and awareness programs being broadcast as deduced from the usage of phrases like personal protective equipment, confirmed case covid, people tested positive, number covid case/number coronavirus case, social distancing guideline, practice social distancing, etc. the crucial and commendable role played by world health organization, centers for disease control and prevention (cdc), john hopkins university, and health care workers all over the globe in shaping the different challenges and aspects of this pandemic is also prominently noted from the table. a remarkable observation is that only three state leaders have made it to this list, namely the president of the united states of america (whose name is incidentally in the third most common tri-gram), and the prime ministers of united kingdom and india, which emphasizes the prominence they enjoy as world leaders in global news, even in these times of distress. in this last part of the experiments, the observations about the delayed impact of globally changing count of affected patients and deaths on the news sentiment as seen in the previous section have been used to identify similar trends for some specific countries using the respective correlation values. the study is conducted for four countries ordered chronologically, based on when the first virus outbreak occurred in that area, and all articles mentioning country x have been extracted from online global news to perform the corresponding case study on country x. for this purpose, the authors have extracted all news articles corresponding to the countries in question, from the comprehensive global news corpus, for the whole period of the study. also, in this experiment, z-score [ ] technique has been used on both the variables, to normalize the values prior to visualization. the z-score is used to bring values of different variables on the same scale, and is calculated as: where, x i denotes the current data element, denotes the mean of the variables, and is standard deviation. using this method, the data for each variable are converted to have a mean of , so in the following graphical representations, all values below the mean will denote a decreasing trend and vice versa. a visual analysis of these images reveals how the observations are generally applicable throughout the data from different countries; that is, whether the global news sentiment about a country is actually affected by the daily trends in number of new cases or deaths. this is determined by the individual correlation of countrywise statistics with appropriately time-shifted global online news about that country. the scatter plots are generated for the four countries in question. in every set of two plots for each country, perfect or partial overlaps signify only discrete, temporal alignment of the variables, and cannot be treated as a measure of continued similarity in trend, which can be better determined from a set of parallelly distributed data values. the current virus outbreak is believed to have originated in china much early, in the month of december , and so, the current duration of study has witnessed a sharply flattening curve in the number of cases, and complete prevention of deaths successfully. among the . million news texts, only those that feature 'china' have been extracted along with the corresponding sentiment index values per day. the correlation coefficients determined by sliding window approach are quite low and insignificant from a statistical point of view, as calculated and shown in table . however, in the current context, such values are indicative of loosely positive similarity in trends. remarkably, there seems to be an immediate impact of the number of daily deaths per million in china on the global news, whereas the number of cases per million takes quite some time. along with this, the highly minimized and flattened death or infection rate is evident from figs. and . also, it is seen that in spite of the flattened curves for cases and deaths, the negativity index values are distinctly high, and show a decreasing trend only after the th day of our study. the corresponding correlation coefficient values indicate more parallelly aligned set of points as seen in the first figure for the number of cases, while the points are more dispersed around the flattened death curve in the second figure, in spite of multiple overlaps, shown as deep red. observations: the observed negativity, though generally aligned, could be due to different other issues as evident from the global news related to china (shown in table ). for instance, the rise in negativity during days - of the study relate with news articles and , while articles , , and attest to the decline in negativity that follows. on a similar note, the high negativity around days - of the study can be attributed to articles to , while the succeeding positivity is enforced by articles like - . therefore, it is evident that the global news agenda related to china is mostly motivated in driving an overall negative image of the country and its actions during the ongoing pandemic. the outbreak spread to usa in late january, and a substantial part of the pandemic's effect on news is observable in this case. similar to the previous case, the news table shows that the number of confirmed cases has more impact on negative sentiment in the news based on usa, at a delay of days, and a lower impact of the number of deaths at an overall delay of days. the overall correlation is weakly positive for both the pair of variables. the spread of both the number of cases and deaths, in the case of the usa, resembles bell curve for the current duration of study, with gradually increasing values up to day , and an opposite trend thereafter. figures and show that towards the later half of the studied duration, the overall number of confirmed cases and deaths follows a decreasing trend (more data points below mean), whereas negative sentiment thrives and even increases. observations: apart from the effect of covid- -related statistics, different media reports citing the anti-china sentiment of the president of the country and governmental decisions appear to have influenced the news sentiment, as well. a set of such news articles has been provided in table , while the prominence of the us president in global news is already established in table . the high amount of negativity during the initial days of the study, may be an effect of the articles - , while the decreasing negativity since day may be due to the event that article and correspond to. similarly, the positive sentiment at about day is aligned with the article , whereas the succeeding rapid rise in negativity (in spite of a drop in covid- cases and deaths) could be attributed to events highlighted by articles - . similar to the observations regarding china, the agenda of global online news is driven more by different socio-political activities concerning the country. italy is one of the most badly affected countries due to the covid- virus outbreak. during our period of study, both the death count as well as number of more than , physicians across the country signed a letter urging president donald trump to keep social distancing practices in place after he said he wants to reopen businesses by easter. "significant covid- transmission continues across the united states, and we need your leadership in supporting science-based recommendations on social distancing that can slow the virus," the letter, released by the council of medical specialty societies, said. "our societies have closely adhered to these measures by moving our staff to fulltime telework and canceling in-person meetings (including annual meetings). these actions have helped to keep physicians and other health professionals in health care facilities, including hospitals, and reduce the risk of spreading covid- " health care workers say that they are being asked to reuse and ration disposable masks and gloves. a shortage of ventilators, crucial for treating serious covid- cases, has also become critical, as has a lack of test kits to comply with the world health organization's exhortations to test as many people as possible. in the united states, a fierce political battle over ventilators has emerged, especially after president donald trump told state governors that they should find their own medical equipment if they think they can get it faster than the u.s. government president donald trump signed into law the unprecedented $ trillion economic stimulus package friday, capping a week that saw markets yo-yo as recession concerns grew worldwide. now that the package has been signed into action, attention turns to how quickly the u.s. treasury and other departments can distribute checks to individual americans and businesses grappling with the ongoing effects of covid- . it could prove to be a herculean effort to flood the money into the economy quick enough to prevent more job losses and businesses going under president donald trump signed an unprecedented $ . trillion economic rescue package into law after swift and near-unanimous action by congress to support businesses, rush resources to overburdened health care providers, and help struggling families during the deepening coronavirus epidemic. acting with unity and resolve unseen since the / attacks, washington moved urgently to stem an economic free fall caused by widespread restrictions meant to slow the spread of the virus that have shuttered schools, closed businesses and brought american life in many places to a virtual standstill. "this will deliver urgently needed relief," trump said as he signed the bill friday in the oval office, flanked only by republican lawmakers confirmed cases are seen to be gradually declining. the global news articles which feature 'italy' have been extracted along with the corresponding sentiment category of each article for this experiment. similar to the previous experiments, for assessing the impact of death or infection-based statistics on news sentiment, a study of correlation has been undertaken. this helps to determine the measure by which the news sentiment reflects the ground reality, by considering days shifted one at a time upto days. the results of the study for italy, as shown in table . it is seen that there is maximum impact of the covid- situation in italy, on global news, on the th day, though there is a high continuing correlation. accordingly, the aligned scatter plots are generated using the z-scored, normalized values, as shown in figs. and . evidently from the table and figures, there exists a higher correlation between the deaths in italy and negativity index in global news, than that due to number of infected cases, although both these variables show a comparatively strong correlation with the negative news sentiment. this can also president donald trump attacked the united nations health body as a chinese "puppet" on monday and confirmed he is considering slashing or cancelling us support. "they're a puppet of china, they're china-centric to put it nicer," he said at the white house. trump said the united states pays around $ million annually to the world health organization, the largest contribution of any country. plans are being crafted to slash this because "we're not treated right. they gave us a lot of bad advice," he said of the who be observed by the higher number of complete and partial overlaps, as well as the gradually decreasing dispersion of the negativity proportional to the parametric values of confirmed cases or deaths in and . observations: due to the determined strong correlation, it can be determined that covid- statistics are most effective on global news sentiment regarding italy. however, a small set of relevant news articles has been put up in table . though the first confirmed covid- case in india was noted at almost the same time as italy, the rising effect of outbreak is quite clear in our studied time period. the study reveals interesting results, where both the number of affected cases, and number of deaths, is steadily increasing during the time period considered. the correlation coefficients determined by shifted negativity index table . surprisingly, the correlations are all negative in nature, indicating that the overall impact of rising deaths and spread of covid- in india has a very weak effect on global news sentiment about india. given that the study intends to determine the similarity in trends of news sentiment and death or infection statistics, the least negative correlation coefficient values are selected for visualizing the trends, which are noted at a delay of days in each case. a notable fact is that, statistically, this minimum negativity indicates almost no correlation. the same is depicted in the scatter plots in figs. and , where the negativity index values are highly dispersed, and even show a decreasing trend in the later half of the study in spite of the steep climb of actual statistics. as noted in "experiment : sentiment analysis" section, the neutral news has minimal role in the global scenario, and that should be significantly minimized at a country-wide level. a possible inference may be that the negative sentiment in global news based on 'india' is minimized so as to prevent panic among the huge population, or that the global news is not really representative of only the covid- statistics in indian context. observations: the lack of proper correlation suggests that the news agenda is influenced by many factors other that covid- , during our period of study. table highlights some of the problems that were initially a cause of the massive negativity in news sentiment in spite of the minimum rate of covid- affection. this covers several socio-economic aspects of indian life during this crisis, and the analysis and discussion of such observations in itself, can be articulated as a full-fledged study of the agenda setting policies of online news media. the proposed work addresses the challenge of identifying the general sentiment in globally published news articles as an effect of the ongoing pandemic, in both unsupervised and transfer learning-based approaches, on comprehensive data gathered for a fixed period of time. a statistical study is also undertaken to determine the impact of variations in the number of affected patients and deaths due to the covid- virus, on the news sentiment at a global scale. the same study is also repeated for some countries and the sentiment of global news which pertain to the effect of covid- in those countries, by considering normalized values of all variables. the observations are substantiated by n-gram analysis that highlights the most prominent tri-grams or three-word phrases that have been used in online news globally. the strongest correlation between news sentiment and covid- statistics exists for italy, which is almost similar to the observation considering news and mumbai police friday arrested three men for allegedly storing bottles of hand sanitiser, worth an estimated rs . lakh, at a flat in mahim and illegally selling them above their maximum retail prices. the crime branch raided the flat after it received information that ml bottles of hand sanitiser were being sold for rs , which was rs more than the mrp the national commission of women ncw has received over complaints, since the country-wide lockdown was imposed to control the spread of coronavirus out of which were cases of domestic violence which it said has been increasing since then. since the lockdown was imposed, a total of complaints related to various offences against women were received out of which complaints are related to domestic violence the data released by the ncw showed. ncw chairperson rekha sharma said the number of cases of domestic violence must be much higher, but the women are scared to complain due to constant presence of their abuser at home kumar said, his union is in touch with nearly families who need the rations urgently, having lost incomes for over a week now. for lakhs of migrant workers in maharashtra, lack of clear information has continued to cause anxiety, especially after the centre and state governments issued instructions sunday to prevent them from attempting to return to their native places. their biggest concern being accessible accommodation and food for the remainder of the -day lockdown period. "what's going to happen will be reminiscent of the bengal famine" "while no definite conclusion can be drawn, this is probably due to the circumspection on the part of victims in reporting such incidents due to the presence of the perpetrators in the house and the fear of further violence if such attempt to report were made known to the perpetrator", the commission had said. it had also said that the cases of molestation, sexual assault, rape, kidnapping, and stalking have decreased manifold presumably, since a large number of these incidents take place outside the domestic setting and by third parties. aichls in its plea has contended that incidents of domestic violence and child abuse have gripped not only india, but countries such as australia uk and usa, and the reports suggest that countries are witnessing a horrific surge in domestic violence cases the video showed around migrant workers sitting on the roadside in full clothes, including women, while water jets were showered on them through fire tenders by men in white protective kit. in the video, one of the officials is heard asking the migrants to keep their eyes shut after the lockdown announcement, the badli workers in west bengal's jute mills are the worst affected out of the lot "we may survive from corona but not hunger": bengal's daily wage workers struggle for survival. in india, thousands of workers are lining up twice a day for bread and fried vegetables to keep hunger at bay job loss pay cuts worry indians the most during lockdown: survey. every in indians is now worried about losing his or her job as the coronavirus pandemic has shut industries and businesses in india, a new survey warned on wednesday. according to the survey conducted by yougov, an internet-based market research and data analytics firm, some indians worry about the economic impact of the virus such as losing their jobs ( %), getting a pay cut ( %), or not getting a bonus or increment this year ( %) statistics on a global scale. the authors have also utilized a set of relevant news articles to substantiate the observations during the case studies. the authors have determined that negativity is a pre-dominant sentiment in global news, and that the covid- -related real-world statistics, agenda setting by news agencies as well as different social (such as job loss, migrant worker problems) and political factors (such as the continued tussle between the presidents of the usa and china), drive the negativity in online news quite strongly, which could lead to long-standing effects on mental heath of the news audience. the results lead to relevant questions and consequently a plethora of computational and social study-based research challenges. such studies will be useful in determining the long-standing, psychological effects of news sentiment on mental health in a pandemic situation, representation of regional challenges in online global news, news media agenda setting, etc. in future, the authors wish to extend this work by utilizing country-specific news data in their respective national official languages, which will aid in further fine-grained analysis. q and a on coronaviruses (covid- ). accessed time spent on watching tv, with smartphone rises as people stay indoors: barc data the psychological effects of tv news what constant exposure to negative news is doing to our mental health health experts on the psychological cost of covid- sentiment analysis of english tweets: a comparative study of supervised and unsupervised approaches mining sentiment classification from political web logs mastering machine learning with scikit-learn sentiment analysis of textual reviews; evaluating machine learning, unsupervised and sentiwordnet approaches thumbs up?: sentiment classification using machine learning techniques sentiment analysis of movie review comments deep convolution neural networks for twitter sentiment analysis unitn: training deep convolutional neural network for twitter sentiment classification unsupervised learning of semantic orientation from a hundredbillion-word corpus unsupervised method for sentiment analysis in online texts creating emoji lexica from unsupervised sentiment analysis of their descriptions a framework for sentiment analysis in turkish: application to polarity detection of movie reviews in turkish sentiwordnet . : an enhanced lexical resource for sentiment analysis and opinion mining afinn. richard petersens plads, building twitter, myspace, digg: unsupervised sentiment analysis in social media a comparative study on twitter sentiment analysis: which features are good valento: sentiment analysis of figurative language tweets with irony and sarcasm character-to-character sentiment analysis in shakespeare's plays word vec and doc vec in unsupervised sentiment analysis of clinical discharge summaries bias-aware lexicon-based sentiment analysis using text mining and sentiment analysis for online forums hotspot detection and forecast opinion mining on large scale data using sentiment analysis and k-means clustering sentiment analysis with global topics and local dependency network text sentiment analysis method combining lda text representation and grucnn. personal and ubiquitous computing sentiment analysis and the complex natural language stock price prediction using news sentiment analysis sentiment analysis on english financial news finenews fine-grained semantic sentiment analysis on financial microblogs and news market trend prediction using sentiment analysis: lessons learned and paths forward enhanced news sentiment analysis using deep learning methods discovering the correlation between stock time series and financial news time series analysis on stock market for text mining correlation of economy news trading strategies to exploit blog and news sentiment agenda-setting the agenda-setting function of mass media problems and opportunities in agenda-setting research after disaster: agenda setting, public policy, and focusing events the dynamics of public attention: agenda-setting theory meets big data agenda-setting effects of business news on the public's images and opinions about major corporations coronavirus source data now live updating & expanded: a new dataset for exploring the coronavirus narrative in global online news idiot's bayes-not so stupid after all? learning word vectors for sentiment analysis gaussian distribution proceedings of the royal society of london. number v. advanced engineering mathematics, th edn publisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations on behalf of all authors, the corresponding author states that there is no conflict of interest. key: cord- -xvc wx authors: wink, michael title: chapter allelochemical properties or the raison d'être of alkaloids date: - - journal: nan doi: . /s - ( ) - sha: doc_id: cord_uid: xvc wx this chapter provides evidence that alkaloids are not waste products or functionless molecules as formerly assumed, but rather defense compounds employed by plants for survival against herbivores and against microorganisms and competing plants. these molecules were developed during evolution through natural selection in that they fit many important molecular targets, often receptors, of cells, which are seen in molecules that mimic endogenous neurotransmitters. the chapter discusses that microorganisms and herbivores rely on plants as a food source. since both have survived, there must be mechanisms of adaptations toward the defensive chemistry of plants. many herbivores have evolved strategies to avoid the extremely toxic plants and prefer the less toxic ones. many herbivores have potent mechanisms to detoxify xenobiotics, which allow the exploitation of at least the less toxic plants. in insects, many specialists evolved that are adapted to the defense chemicals of their host plant, in that they accumulate these compounds and exploit them for their own defense. alkaloids function as defense molecules against insect predators in the examples studied, and this is further support for the hypothesis that the same compound also serves for chemical defense in the host plant. it needs more experimental data to understand fully the intricate interconnections between plants, their alkaloids, and herbivores, microorganisms, and other plants. organisms. we must also consider that plants compete with other plants (of the same or different species) for light, water, and nutrients. how do plants defend themselves against microorganisms (including bacteria, fungi, and viruses), herbivores, and plants? because plants do rather well in nature, this question has often been overlooked. we are well aware of the defensive strategies of higher animals against microbes and predators ( , , , , , , ) . the complex immune system with its cellular and humoral components is a well-studied area in the context of vertebrate-microbe interactions. against predating animals, nature evolved weapons, armor, crypsis, thanatosis, deimatic behavior, aposematism, flight, or defense chemicals (usually called "poisons") ( ). it is evident that most of these possibilities are not available for plants with their sessile and "passive" life-style. what then is their evolutionary solution? we can distinguish the following defense mechanisms in plants ( , , , , ) ; the mechanisms are not independent and may act cooperatively and synergistically. we should be aware that many species have additionally evolved specialized traits in this context. . mechanical protection is provided by thorns, spikes, trichomes, glandular hairs, and stinging hairs (which are often supported by defense chemicals). . formation of a thick bark on roots and stems can be considered as a sort of armor, and the presence of hydrophobic cuticular layers as a penetration barrier directed against microbes. b. cell walls are biochemically rather inert with reduced digestibility to many organisms because of their complex cellulose, pectin, and lignin molecules. callose and lignin are often accumulated at the site of infection or wounding ( , ) and form a penetration barrier. c. synthesis of inhibitory proteins (e.g., lectins, protease inhibitors) or enzymes (e.g., chitinase, lysozyme, hydrolases, nucleases) that could degrade microbial cell walls or other microbial constituents would be protective, as well as synthesis of peroxidase and phenolase, which could help inactivate phytotoxins produced by many bacteria and fungi. these proteins are either stored in the vacuole . allelochemical properties of alkaloids or are secreted as exoenzymes into the cell wall or the extracellular space ( , ) . these compounds are thus positioned at an "advanced and strategically important defense position." in addition, storage proteins (of cereals and legumes) are often deficient in particular essential amino acids, such as lysine or methionine. d. as a widely distributed and important trait, secondary metabolites with deterrenthepellent or toxic properties against microorganisms, viruses, and/or herbivores may be produced ( - , - ) . these allelochemicals can be constitutively expressed, they may be activated by wounding (e .g., cyanogenic glycosides, glucosinolates, coumaryl glycosides, alliin, ranunculin), or their de ~o u o synthesis may be induced by elicitors (so-called phytoalexins), infection, or herbivory ( , , [ ] [ ] [ ] . these products are often synthesized and stored at strategically important sites [epidermal tissues or in cells adjacent to an infection ( , )] or in plant parts that are especially important for reproduction and survival [flowers, fruits, seeds, bark, roots ( , , ) ]. in animals, we can observe the analogous situation in that many insects and other invertebrates (especially those which are sessile and unprotected by armor), but also some vertebrates, store secondary metabolites for their defense which are often similar in structure to plant allelochemicals ( , , , , , - , [ ] [ ] [ ] ) . in many instances, the animals have obtained the toxins from their host plants ( , , , , - ). hardly any zoologist or ecologist doubts that the principal function of these secondary metabolites (which are often termed ''toxins" in this context) in animals is that of defense against predators or microorganisms ( , , , [ ] [ ] [ ] . these defense compounds are better known as natural products or secondary metabolites. the latter expression originally meant compounds which are not essential for life, and thus distinct from primary metabolites ( , , ) . unfortunately the term "secondary" has also a pejorative meaning, indicating perhaps that the compounds have no importance for the plant. as discussed in this chapter, just the opposite is true. more than , natural products have been reported from plants so far ( , , ) . owing to the sophistication in phytochemical methods, such as chromatography (hplc, glc) and spectroscopy (nmr, ms) , new products are reported at rapid intervals. because only - % of all higher plants, which consist of over , species, have been analyzed phytochemically in some detail, the overall real number of secondary products is certainly very large. it is a common theme that an individual plant does not produce a single natural product, but usually a moderate number of major metabolites and a larger number of minor derivatives. within a taxon secondary metabolites often share a common distribution pattern and are therefore of some importance for phytochemical systematics. classic taxonomy, however, has taken little account of alkaloid distribution: if the same alkaloid is present in two plants of the same taxon, this is interpreted as evidence for a relationship, but its occurrence in two plants of nonrelated taxa is taken as evidence of independent evolution. because secondary metabolites are also derived characters that were selected during evolution, their general value for taxonomy and systematics is certainly smaller than formerly anticipated ( ). for many years, secondary metabolites were considered as waste products or otherwise functionless molecules, merely illustrating the biochemical virtuosity of nature ( , ) . in and , errera and stahl ( , , ) published the idea that natural products are used by plants for chemical defense against herbivores. since the leading plant physiologists of that time were mostly anti-darwinian, they were not willing to accept the defense argument, which was too much in line with the darwinian concept. therefore, this early defense concept was negated and remained forgotten for nearly years. in , fraenkel( ) reopened the debate in a review article and presented new data supporting the view that secondary metabolites serve as chemical defense compounds against herbivores. during the next three decades this concept was improved experimentally, and we can summarize the present situation as follows although the biological function of many plant-derived secondary metabolites has not been studied experimentally, it is now generally assumed that these compounds are important for the survival and fitness of a plant and that they are not useless waste products, as was suggested earlier in the twentieth century ( , ) . in many instances, there remains a need to analyze whether a given compound is active against microorganisms (viruses, bacteria, fungi), against herbivores (molluscs, arthropods, vertebrates), or against competing plants (so-called allelopathy). in some instances, additional functions are the attraction of pollinating or seed-dispersing animals, for example, by colored compounds such as betalains (within the centrospermae), anthocyanins, carotenoids, and flavonoids or by fragrances such as terpenes, amines, and aldehydes ( , ) . physiological roles, such as uv protection [by flavonoids or coumarins ( , )], nitrogen transport or storage ( , , ) , or photosynthesis (carotenoids), may be an additional function. allelochemicals are often not directed against a single organism, but generally against a variety of potential enemies, or they may combine the roles of both deterrents and attractants (e.g., anthocyanins and many essential oils can be attractants in flowers but are also insecticidal and antimicrobial). thus, many natural products have multiple functions, a fact which is easily overlooked since most scientists usually specialize on a narrow range of organisms (i.e., a microbiologist will usually not check whether an antibiotic alkaloid also deters the feeding of caterpillars). to understand all the interactions we need to adopt a holistic, that is, interdisciplinary, approach. it might be argued that the defense hypothesis cannot be valid since most plants, even those with extremely poisonous metabolites (from the human point of view), are nevertheless attacked by pathogens and herbivores. however, we have to understand and accept that chemical defense is not an absolute process. rather, it constitutes a general barrier which will be effective in most circumstances, that is, most potential enemies are repelled or deterred. plants with allelochemicals at the same time represent an ecological niche for potential pathogens and herbivores. during evolution a few organisms have generally been successful in specializing toward that niche (i.e., in a particular toxic plant) in that they found a way to sequester the toxins or become immune to them ( , , ) . this is especially apparent in the largest class of animals, the insects (probably with several million species on earth), which are often highly host plant specific. the number of these "specialists" is exceedingly small for a given plant species as compared to the number of potential enemies that are present in the ecosystem. we can compare this situation with our immune system: it works against the majority of microorganisms but fails toward a few viruses, bacteria, fungi, and protozoa, which have overcome this defense barrier by clever strategies. nobody would call the immune system and antibodies useless because of these few adapted specialists! we should adopt the same argument when we consider plants' defenses by secondary metabolites ( ) . since secondary metabolites have evolved in nature as biologically active compounds with particular properties in other organisms, many of them are useful to mankind as pharmaceuticals, fragrances, flavors, colors, stimulants, or pesticides. in addition, many allelochemicals provide interesting lead structures that organic medicinal chemists can develop into new and more active compounds. about - % of higher plants accumulate alkaloids ( , ). the incidence of alkaloid production varies between taxa to some degree; for example, about - % of species of the solanaceae and apocynaceae are michael wink alkaloidal, whereas other families contain few alkaloid-producing species. some alkaloids have a wide distribution in nature: caffeine occurs in the largest number of families, lycorine in the largest number of genera and berberine in the largest number of species. alkaloids are not restricted to higher plants (although they are here most numerous); they are also present in club mosses (lycopodium), horsetails (equisetum), fungi, and animals such as marine worms (e.g., nereidae), bryozoans, insects (e.g., coccinellidae, solenopsidae), amphibians (toads, frogs, salamanders), and fishes. alkaloids thus represent one of the largest groups of natural products, with over , known compounds at present, and they display an enormous variety of structures, which is due to the fact that several different precursors find their way into alkaloid skeletons, such as ornithine, lysine, phenylalanine, tyrosine, and tryptophan ( ) ( ) ( ) . in addition, part of the alkaloid molecule can be derived from other pathways, such as the terpenoid pathway, or from carbohydrates ( [ ] [ ] [ ] . whereas the structure elucidation of alkaloids and the exploration of alkaloid biosynthetic pathways have always commanded much attention, there are relatively few experimental data on the ecological function of alkaloids. this is the more surprising since alkaloids are known for their toxic and pharmacological properties and many are potent pharmaceuticals. alkaloids were long considered to be waste products [even by eminent alkaloid researchers such as w. . james and kurt mothes ( , , )l. because nitrogen is a limiting nutrient for most plants, a nitrogenous waste product would be a priori unlikely. the waste product argument probably came from animal physiology: carnivorous animals take up relative large amounts of proteins and nucleic acids, containing more nitrogen than needed for metabolism, which is consequently eliminated as uric acid or urea. a similar situation or need, however, is not applicable for plants. in fact, many plants remobilize their nitrogenous natural products (including alkaloids) from senescing organs such as old leaves ( , , ). if alkaloids were waste products, we would expect the opposite, namely, accumulation in old organs which are shed. on the other hand, the alkaloids produced by animals were never considered to be waste products by zoologists, but rather regarded as defense chemicals ( , , ) . thus, the more plausible hypothesis is that alkaloids of plants, microorganisms, and animals, like other allelochemicals, serve as defense compounds. this idea is intuitively straightforward, because many alkaloids are known as strong poisons for animals and homo sapiens. as a prerequisite for an alkaloid to serve as a chemical defense compound we should demand the following criteria. ( ) the alkaloid should have significant effects against microbes and/or animals in bioassays. ( ) the compounds should be present in the plant at concentrations that are of the same order (or, better, even higher) as those determined in the bioassays. ( ) the compound should be present in the plant at the right time and the right place. ( ) evidence should be provided that a particular compound is indeed important for the fitness of a plant. although more than , alkaloids are known, only few (- - %) have been analyzed for biochemical properties, and even fewer for their ecophysiological roles. in most phytochemical studies only the structures of alkaloids have been elucidated, so that often no information is available on their concentrations in the different parts and through the ontogenetic development of a plant, or on their biological activities. furthermore, the corresponding studies were usually designed to find useful medicinal or sometimes agricultural applications of alkaloids, not to elucidate their evolutionary or ecological functions. these objections have to be kept in mind, because an alkaloid is sometimes termed "inactive" in the literature, which usually means less active than a standard compound already established as a medicinal compound (such as penicillins in antimicrobial screenings). in many medicinal experiments relatively low doses are applied because of the toxic properties of many alkaloids. if the same compound would have been tested at relevant (which normally means elevated) concentrations that are present in the plant, an ecologically relevant activity might have been detected. another restriction is that the activities of alkaloids have been tested with organisms that are sometimes irrelevant for plants but medicinally important. however, if a compound is active against escherichia coli, it is likely that is is also active against other gram-negative and plant-relevant bacteria. nevertheless, most of the data obtained in these studies (tables i-viii ) provide important information which at present permits extrapolation to the function of alkaloids in plants. in this chapter the focus is on the biological activity of alkaloids (the information available on the pharmacological properties of alkaloids is mostly excluded), and we try to discuss these data from an ecological perspective. in the following, the possible functions of alkaloids in plant-animal, plant-plant, and plant-microbe interactions are discussed in more detail. it is nearly impossible to cover the literature exhaustively. therefore, an overview of the allelochemical properties of alkaloids is presented. because of the large amount of data (literature up to is included), the selection of examples must remain subjective to some degree. nevertheless, the author would be grateful to receive information or publications about relevant omissions. because homo supiens and domestic animals are to some degree herbivores, a large body of empirical knowledge has accumulated on the toxic properties of alkaloids (tables i through v) and alkaloid-containing plants. previously, the toxic properties of alkaloids in vertebrates was part of the definition (as a common denominator) for this group of natural products ( , ) . in the following, the toxic or adverse effects of alkaloids are separately discussed for invertebrates (mainly insects) and vertebrates. among the invertebrates, insects have been extremely successful from the evolutionary point of view, and they form the largest class of organisms on our planet as far as the number of both individuals and species is concerned. entomologists estimate that the number of insects is at least million, but tropical rain forests may harbor up to - million species, many of which are still unknown and, owing to the fast extinction of this ecosystem, will probably also disappear without having been discovered and studied by scientists. most insects are herbivores, and adaptation to host plants and their chemistry is often very close and complex ( i , , , , , - , , ) . whereas insects rely on plants for food, many plants need insects for pollination and seed dispersal. in the latter context we often find that plants attract insects by chemical means (colors, fragrances, sugars, amino acids). at the same time, other secondary metabolites are employed to discourage the feeding on flowers and seeds. the close association between plants, especially the angiosperms, and insects evolved during the last million years. some scientists have called this phenomenon a "coevolutionary" process, but it has to be recalled that the associations seen today are not necessarily those in which the chemical interactions originally evolved ( , , ). applications of synthetic insecticides have shown that resistance to these new compounds can occur rapidly, sometimes encompassing only a dozen generations. times can also be much longer. if plant species are introduced to a new continent or island, it usually takes a long time before new pathogens or herbivores become adapted and specialized to this new species. for example, lupinus polyphyllus from north america has a number of specialized herbivores, but is rarely attacked by herbivores in europe. this lupine left its enemies behind when it was transferred to europe three centuries ago. about years ago, however, the north american lupine aphid (macrosiphum albifrons) was introduced to europe accidentally. this aphid is specialized to alkaloid-rich lupines with lupanine as a major alkaloid. at present, this aphid has spread over most of europe and is now colonizing its former host, l. polyphyllus ( , ) . insect herbivores can be divided into two large groups whose strategies with respect to the plant's defense chemistry differ substantially ( ). the polyphagous species can exploit a wide range of host plants, whereas the mono-/oligophagous insects are often specialized on one or a small number of (often systematically related) hosts. polyphagous insects, namely, species which feed on a wide variety of food plants, are usually endowed with fantastic and powerful olfactory receptors ( ) that allow the distinction between plants with high or low amounts of "toxins." the receptors also allow insects to ascertain the quality of the essential products present, such as lipids, proteins, or carbohydrates ( ). these "generalists," as we can also call this subgroup of herbivores, are usually deterred from feeding on plants which store especially noxious metabolites and select those with less active ones (such as our crop species, where man has bred away many of the secondary metabolites that were originally present; see table xi ). alternatively, they change host plants rapidly and thus avoid intoxication. in addition, most polyphagous species have evolved active detoxification mechanisms, such as microsomal oxidases and glutathione peroxidase, which lead to the rapid detoxification and elimination of dietary secondary products ( , , , ). in contrast, mono-and oligophagous species often select their host plants with respect to the composition of the nutrients and secondary metabolites present. for these "specialists" the originally noxious defense compounds are often attractive feeding and oviposition stimulants. these insects either tolerate the natural products or, more often, actively sequester and exploit them for their own defense against predators or for other purposes ( , , - , , , , , - ). these observations seem to contradict the first statement, that secondary metabolites are primarily defense compounds, and a number of renowned authors have fallen into this logical pit, such as mothes ( ) and robinson ( ). however, these specialized insects are exceptions to the general rule. for these specialists, the defense chemistry of the host plant is usually not toxic, but they are susceptible to the toxicity of natural toxins from non-host plants ( ) . as compared to the enormous number of potential herbivores, the number of adapted monophagous species is usually very small for a particular plant species. quite a number of alkaloids have been tested toward herbivorous insects (table i ). in general it is observed that many alkaloids can act as feeding deterrents at higher concentrations (>i%, w/w). given the choice, insects tend to select a diet with no or only a small dose of alkaloids. also, specialists avoid most "toxins" except those of their host plants. these data indicate that under natural conditions plants with a high content of alkaloids should be safe from most herbivorous insects, with the exception of particular monophagous species or a few very potent polyphagous ones. if insects have no choice or if they are very hungry, the deterrency threshold value is much reduced, and they often feed on a diet with alkaloids that they would normally avoid ( , ). in this case we have the chance to test the toxicity of an ingested alkaloid. if insects do not take up alkaloid-containing food, alkaloid toxicity can be assessed to some degree by topical application or by injection ( table i) . as can be seen from table i a substantial number of alkaloids display significant insect toxicity, including nicotine, piperine, lupine alkaloids, caffeine, gramine, strychnine, berberine, ephedrine, and steroidal alkaloids. only the specialists can tolerate the respective alkaloids. the tobacco hornworm (manduca sexta), for example, can grow on a diet with more than % nicotine without any adverse effects. most of the nicotine is either degraded or directly eliminated via the malpighian tubules and in feces ( ). because nicotine binds to the acetylcholine (ach) receptor, it is likely that in manduca this receptor has been modified in such a way that ach can still bind, but not nicotine (so-called target site modification). the toxic effects of alkaloids in insects (table i) can be caused by their interference with diverse cellular and intracellular targets. since most mechanisms have not yet been elucidated for insects, this issue is discussed below in the section on vertebrate toxicity (see table iv ). with some caution we can extrapolate to insect toxicity. because homo sapiens and domestic animals are largely herbivores, a voluminous body of information on the adverse effects of secondary metabolites has accumulated over the centuries. many allelochemicals and alkaloids are feeding deterrents for vertebrates, owing to their bitter or pungent taste or bad smell, and instinctively a foul-smelling, bitter, or pungent diet is normally avoided. examples of bitter alkaloids (at least for man) are quinine, strychnine, brucine, and sparteine, and for pungent alkaloids are capsaicin, and piperine. it should be recalled that these taste properties are not identical for all animals. for example, geese, which are obligate herbivores, hardly avoid food with alkaloids or smelly compounds (amines, mercaptoethanol) that man would hardly touch ( ). conversely, fragrances that are attractive to us are highly repellent to geese ( ). even within a given population taste can differ significantly. it has been observed that a substantial proportion of homo sapiens cannot detect the smell of hcn, whereas others are highly sensitive. furthermore, olfactory sensitivity can differ with age, sex, and hormonal cycles. bitterness varies with the chemical structure of an alkaloid. with the quinolizidine alkaloids (qas) the following scale was assessed for man: mean detection levels are . % for sparteine, . % for lupanine, and . % for hydroxylupanine ( ). whereas we know a few parameters of olfactory qualities in homo sapiens, often much less or hardly anything is known for most other vertebrates. alkaloids are famous for their toxic properties in vertebrates, and plants that produce alkaloids are often classified by man as poisonous or toxic plants. for a number of alkaloids the respective ld,, values have been determined with laboratory animals, especially mice, but also rats, guinea pigs, cats, rabbits, dogs, or pigeons. table i presents an overview for alkaloids, including the very poisonous alkaloids aconitine, coniine, atropine, brucine, curarine, ergocornine, physostigmine, strychnine, colchicine, germerine, veratridine, cytisine, delphinidine, and nicotine. toxicity is usually highest if the alkaloids are applied parenterally [intravenously (i.v.), intraperitoneally (i.p.), and subcutaneously (s.c.)] as compared to oral application [per s (p.o.)]. also, some of the alkaloids which are made or stored by animals are strong vertebrate poisons, including batrachotoxin, batrachotoxinin a, anabasine, glomerine, maitotoxin, nereistoxin, palytoxin, saxitoxin, and tetrodotoxin ( , , , ) . although the general toxicity of alkaloids differs from species to species, the data in table i generally show that many alkaloids are more or less toxic to vertebrates. the toxic effects observed with intact animals has its counterpart in the cytotoxic effect, which has been recorded for nearly alkaloids (table ). these data have been obtained by screening many natural products for anticancer activity. however, an alkaloid that can kill a cancer cell is usually also toxic for "normal" cells. therefore, the data shown in table i are another indication of the general toxicity of alkaloids toward animals. because this toxicity applies also for herbivores, the production of alkaloids by plants can certainly be interpreted as a potent antiherbivore mechanism. for a number of alkaloids the mechanisms underlying the toxic effects have already been elucidated in some detail. we can distinguish molecular targets and processes that are important for all cells, such as synthesis of dna, rna, and proteins, replication, transcription, translation, membrane assembly and stability, electron chains, or metabolically important enzymes or proteins including receptors, hormones, and signal compounds (table iv ). in the following we discuss some of these toxic effects. a. cellular targets nucleic acids. dna, the macromolecule which holds all the genetic information for the life and development of an organism, is a highly vulnerable target. it is not surprising that a number of secondary metabolites have been selected during evolution which interact with dna or dnaprocessing enzymes. some alkaloids bind to or intercalate with dna/rna (table iv) and thus affect replication or transcription, or cause mutations, leading to malformations or cancer (table v) : -methoxyellipticine, dictamnine, ellipticine, harmane alkaloids, melinone f, quinine and related alkaloids, skimmianine, avicine, berberine, chelerythrine, coptisine, coralyne, fagaronine, nitidine, sanguinarine, pyrrolizidine alkaloids (pas), cycasin, olivacine, etc. many of the intercalating molecules are planar, hydrophobic molecules that fit within the stacks of at and gc base pairs. other alkaloids act at the level of dna and rna polymerases, such as vincristine, vinblastine, avicine, chelilutine, coralyne, fagaronine, nitidine, amanitine, hippeastrine, and lycorine, thus impairing the processes of replication and transcription. whereas these toxins usually cause a rapid reaction, some alkaloids cause long-term effects in vertebrates in that they are mutagenic or carcinogenic (table v) . besides basic data obtained in salmonella or drosophila, there are a few reports which illustrate the potent mutagenic effect of alkaloids on vertebrates. anagyrine, anabasine, and coniine cause "crooked calf disease" if pregnant cows or sheep feed on these alkaloids during the first period of gestation ( , , , , , ) . the offspring born show strong malformation of the legs. some of the steroid alkaloids (e.g., cyclopamine, jervine, and veratrosine), which are produced by veratrum species, cause the formation of a central large cyclopean eye ( - , an observation that was probably made by the ancient greeks and thus led to the mythical figure of the cyclops. it is likely that any herbivore which regularly feeds on plants containing these alkaloids will suffer from reduced productivity and reduced fitness in the long term. in effect, the plants which contain these alkaloids are usually avoided by vertebrate herbivores. another long-term effect caused by alkaloids with carcinogenic properties has been discovered only recently (tables iv and v) . the alkaloid aristolochic acid, which is produced by plants of the genus aristolochia, is carcinogenic. the mechanism of action of this alkaloid is believed to be similar to the well-known carcinogen nitrosamine ( , ) , because of its no, group. pyrrolizidine alkaloids and their n-oxides, which are abundantly produced by members of the asteraceae and boraginaceae but also occur in the families apocynaceae, celestraceae, elaeocarpaceae, euphorbiaceae, fabaceae, orchidaceae, poaceae, ranunculaceae, rhizo- esterified ( , ) . after oral intake, the n-oxides are reduced by bacteria in the gut. the lipophilic alkaloid base is resorbed and transported to the liver, where it is "detoxified" by microsomal enzymes. as a result, a reactive alkylating agent is generated, which can be considered as a pyrrolopyrrolidine. the alkaloid can then cross-link dna and rna and thus cause mutagenic or carcinogenic effects (especially in the liver) ( ). thus, pyrrolizidine alkaloids represent highly evolved and sophisticated antiherbivore compounds, which utilize the widespread and active detoxification system of the vertebrate liver. the pa story is very intriguing, since it shows how ingenious nature was in the "arms race." the herbivores invented detoxifying enzymes, and nature the compound which is activated by this process. a herbivore feeding on pa-containing plants will eventually die, usually without reproducing properly. only those individuals which carefully avoid the respective bitter-tasting plants maintain their fitnes and thus survive. the protection due to pa can easily be seen on meadows, where senecio and other pa-containing plants are usually not taken by cows and sheep, at least as long other food is available. protein biosynthesis is essential for all cells and thus another important target. indeed, a number of alkaloids have already been detected (although few have been studied in this context) that inhibit protein biosynthesis in uitro (table iv) , such as vincristine, vinblastine, emetine, tubulosine, tyramine, sparteine, lupanine and other quinolizidine alkaloids, cryptopleurine, haningtonine, homohamngtonine, haemanthamine, isohamngtonine, lycorine, narciclasine, pretazettine, pseudolycorine, tylocrebrine, tylophorine, and tylocrepine. for lupine alkaloids, it was determined that the steps which are inhibited are the loading of acyl-trna with amino acids, as well as the elongation step. the inhibitory activity was strongly expressed in heterologous systems, that is, protein biosynthesis in the producing plants, such as lupines, was not affected ( ). electron chains. the respiratory chain and atp synthesis in mitochondria demand the controlled flux of electrons. this target seems to be attacked by ellipticine, pseudane, pseudene, alpinigenine, sanguinarine, tetrahydropalmatine, ch,-(ch ),,- , -methyl-piperidines, capsaicin, the hydroxamic acid dimboa, and solenopsine. as mentioned before, however, only a few alkaloids have been evaluated in this context (table v) . biomembranes and transport processes. a cell can operate only when it is enclosed by an intact biomembrane and by a complex compartmenta- tion that provides separated reaction chambers. because biomembranes are impermeable for ions and polar molecules, cells can prevent the uncontrolled efflux of essential metabolites. the controlled flux of these compounds across biomembranes is achieved by specific transport proteins, which can be ion channels, pores, or carrier systems. these complex systems are also targets of many natural products (table iv) . disturbance of membrane stability is achieved by -methoxyellipticine, ellipticine, berbamine, cepharanthine, tetrandrine, steroidal alkaloids, irehdiamine, and malouetine. steroidal alkaloids, such as solanine and tomatine, which are present in many members of the solanaceae, can complex with cholesterol and other lipids of biomembranes; cells are thus rendered leaky. cells carefully control the homeostasis of their ion concentrations by the action of ion channels (na+,k+, ca + channels) and through na+,k+-atpase and ca +-atpase. these channels and pumps are involved in signal transduction, active transport processes, and neuronal and neuromuscular signaling. inhibition of transport processes (ion channels, carriers) is achieved by (table iv) acronycine, ervatamine, harmaline, quinine, reserpine, colchicine, nitidine, salsolinol, sanguinarine, stepholidine, caffeine, sparteine, monocrotaline, steroidal alkaloids, aconitine, capsaicine, cassaine, maitoxin, ochratoxin, palytoxin, pumiliotoxin, saxitoxin, solenopsine, and tetrodotoxin. a special class of ion channels in the central nervous system and involved in neuromuscular signal transfer are coupled with receptors of neurotransmitters such as noradrenaline (na), serotonin, dopamine, glycine, and acetylcholine (ach). we can distinguish two types. type is a ligand-gated channel (i.e., a receptor), which is part of an ion-channel complex, such as the nicotinergic ach-receptor. in type the receptor is an integral protein. when a neurotransmitter binds, the receptor changes its conformation and induces a conformational change in an adjacent gprotein molecule, which consists of three subunits. the a subunit then activates the enzyme adenylate cyclase, which in turn produces camp from atp. the camp molecule is a second messenger which activates protein kinases or ion channels directly, which in turn open for milliseconds (e.g., the muscarinergic ach receptor). a number of alkaloids are known whose structures are more or less similar to those of endogenous neurotransmitters. targets can be the receptor itself, the enzymes which deactivate neurotransmitters, or transport processes, which are important for the storage of the neurotransmitters in synaptic vesicles. alkaloids relevant here include (table iv) brucine, ergot alkaloids, eseridine, serotonin, physostigmine, gelsemine, p-carboline alkaloids, strychnine, yohimbine, berberine, bicuculline, bulbocapnine, columbamine, coptisine, coralyne, corlumine, ephedrine, ga- lanthamine, laudanosine, nuciferine, palmatine, papaverine, thebaine, cytisine and other quinolizidine alkaloids, heliotrine, chaconine and other steroidal alkaloids, cocaine, atropine, scopolamine, anabaseine, arecoline, dendrobine, gephyrotoxin, histrionicotoxin, methyllycaconitine, muscarine, nicotine, pilocarpine, psilocin, psilocybin, morphine, mescaline, and reserpine. a number of these alkaloids are known hallucinogens, which certainly decrease the fitness of an herbivore feeding on them regularly. cytoskeleton. many cellular activities, such as motility, endocytosis, exocytosis, and cell division, rely on microfilaments and microtubules. a number of alkaloids have been detected which can interfere with the assembly or disassembly of microtubules (table iv) , namely, vincristine, vinblastine, colchicine, maytansine, maytansinine, and taxol. colchicine, the major alkaloid of colchicum autumnale (liliaceae), inhibits the assembly of microtubules and the mitotic spindle apparatus. as a consequence, chromosomes are no longer separated, leading to polyploidy . whereas animal cells die under these conditions, plant cells maintain their polyploidy, a trait often used in plant breeding because polyploidy leads to bigger plants. because of this antimitotic activity, colchicine has been tested as an anticancer drug; however, it was abandoned because of its general toxicity. the derivative colcemide is less toxic and can be employed in the treatment of certain cancers ( ). also, cellular motility is impaired by colchicine; this property is exploited in medicine in the treatment of acute gout, in order to prevent the migration of macrophages to the joints. for normal cells, and thus for herbivores, the negative effects can easily be anticipated, and colchicine is indeed a very toxic alkaloid which is easily resorbed because of its lipophilicity . another group of alkaloids with antimitotic properties are the bisindole alkaloids, such as vinblastine and vincristine, which have been isolated from catharanthus roseus (apocynaceae). these alkaloids also bind to tubulin ( ). both alkaloids are very toxic, but are nevertheless important drugs for the treatment of some leukemias. from taxus baccata (taxaceae) the alkaloid taxol has been isolated. taxol also affects the architecture of microtubules in inhibiting their disassembly ( ). nonalkaloidal compounds to be mentioned in this context include the lignan podophyllotoxin ( ). in conclusion, any alkaloid which impairs the function of microtubules is likely to be toxic, because of their importance for a cell, and, from the point of view of defense, a wellworking and well-shaped molecule. enzyme inhibition. the inhibition of metabolically important enzymes is a wide field that cannot be discussed in full here (see table iv ). briefly, inhibition of camp metabolism (which is important for signal transduction and amplifications in cells), namely, inhibition of adenylate cyclase by anonaine, isoboldine, tetrahydroberberine and inhibition of phosphodiesterase by -ethyl-p-carboline, p-carboline- -propionic acid, papaverine, caffeine, theophylline, and theobromine are some examples. inhibition of hydrolases, such as glucosidase, mannosidase, trehalase, and amylase, is specifically achieved by some alkaloids (table iv) b. action at organ level. whereas the activities mentioned before are more or less directed to molecular targets present in or on cells, there are also some activities that function at the level of organ systems or complete organisms, although, ultimately, they have molecular targets, too. central nervous system and neuromuscular junction. a remarkable number of alkaloids interfere with the metabolism and activity of neurotransmitters in the brain and nerve cells, a fact known to man for a thousand years (table iv) . the cellular interactions have been discussed above. disturbance of neurotransmitter metabolism impairs sensory faculties, smell, vision, or hearing, or they may produce euphoric or hallucinogenic effects. a herbivore that is no longer able to control its movements and senses properly has only a small chance of survival in nature, because it will have accidents (falling from trees, or rocks, or into water) and be killed by predators. thus euphoric and hallucinogenic compounds, which are present in a number of plants, and also in fungi and the skin of certain toads, can be regarded as defense compounds. some individuals of homo sapiens use these drugs just because of their hallucinogenic properties, but here also it is evident that long-term use reduces survival and fitness dramatically. the activity of muscles is controlled by ach and na. it is plausible that an inhibition or activation of neurotransmitter-regulated ion channels will severely influence muscular reactivity and thus the mobility or organ function (heart, blood vessels, lungs, gut) of an animal. in the case of inhibition, muscles will relax; in the case of overstimulation, muscles will be tense or in tetanus, leading to a general paralysis. alkaloids which activate neuromuscular action (so-called parasympathomimetics) include nicotine, arecoline, physostigmine, coniine, cytisine, and sparteine. inhibitory (or parasympatholytic) alkaloids include hyoscyamine and scopolamine, (see above) ( ) . skeletal muscles as well as muscle-containing organs, such as lungs, heart, circulatory system, and gut, and the nervous system are certainly very critical targets. the compounds are usually considered to be strong poisons, and it is obvious that they serve as chemical defense compounds against herbivores, since a paralyzed animal is easy prey for predators or, if higher doses are ingested, will die directly (compare ld,, values in table ). inhibition of digestive processes. food uptake can be reduced by a pungent or bitter taste in the first instance, as mentioned earlier. the next step may be the induction of vomiting, diarrhea, or the opposite, constipation, which negatively influences digestion in animals. the ingestion of a number of allelochemicals such as emetine, lobeline, morphine, and many other alkaloids causes these symptoms ( ). another mode of interference would be the inhibition of carriers for amino acids, sugars, or lipids, or of digestive enzymes. relevant alkaloids are the polyhydroxyalkaloids, such as swainsonine, deoxynojirimycin, and castanospermine, that inhibit hydrolytic enzymes, such as glucosidase, galactosidase, trehalase (trehalose is a sugar in insects which is hydrolyzed by trehalase), and mannosidase selectively (table iv) . nutrients and xenobiotics (such as secondary metabolites) are transported to the liver after resorption in the intestine. in the liver, the metabolism of carbohydrates, amino acids, and lipids takes place with the subsequent synthesis of proteins and glycogen. the liver is also the main site for detoxification of xenobiotics. lipophilic compounds, which are easily resorbed from the diet, are often hydroxylated and then conjugated with a polar, hydrophilic molecule, such as glucuronic acid, sulfate, or amino acids ( ). these conjugates, which are more water soluble, are exported via the blood to the kidney, where they are transported into the urine for elimination. both liver and kidney systems are affected by a variety of secondary metabolites, and the pyrrolizidine alkaloids have been discussed earlier (tables iv and v) . the alkaloids are activated during the detoxification process, and this can lead to liver cancer. also, many other enzyme or metabolic inhibitors (e.g., amanitine), discussed previously, are liver toxins. many alkaloids and other allelochemicals are known for their diuretic activity ( ). for an herbivore, an increased diuresis would also mean an augmented elimination of water and essential ions. since na' is already limited in plant food (an antiherbivore device?), long-term exposure to diuretic compounds would reduce the fitness of an herbivore substantially. disturbance of reproduction. quite a number of allelochemicals are known to influence the reproductive system of animals, which ultimately reduces their fitness and numbers. antihormonal effects could be achieved by mimicking the structure of sexual hormones. these effects are not known for alkaloids yet, but have been confirmed for other natural products. estrogenic properties have been reported for coumarins, which di-merize to dicoumarols, and isoflavones ( , ) . insect molting hormones, such as ecdysone, are mimicked by many plant sterols, which include ecdysone itself, such as in the fern polypodium uulgare, or azadirachtin from the neem tree ( , ) . juvenile hormone is mimicked by a number of terpenes, present in some coniferae. spermatogenesis is reduced by gossypol from cottonseed oil ( ) . the next target is the gestation process itself. as outlined above, a number of alkaloids are mutagenic and lead to malformation of the offspring or directly to the death of the embryo ( table v) . the last step would be the premature abortion of the embryo. this dramatic activity has been reported for a number of allelochemicals, such as mono-and sesquiterpenes and alkaloids. some alkaloids achieve this by the induction of uterine contraction, such as the ergot and lupine alkaloids ( ) . the antireproductive effects are certainly widely distributed, but they often remain unnoticed under natural conditions. nevertheless, they are defense strategies with long-term consequences. blood and circulatory system. all animals need to transport nutrients, hormones, ions, signal compounds, and gas between the different organs of the body, which is achieved by higher animals through blood in the circulatory system. inhibitors of the driving force for this process, the heart muscle, have already been discussed. however, the synthesis of red blood cells is also vulnerable and can be inhibited by antimitotic alkaloids such as vinblastine or colchicine ( ) . some allelochemicals have hemolytic properties, such as saponins. if resorbed, these compounds complex membrane sterols and make the cells leaky. steroidal alkaloids from solanum or veratrum species display this sort of activity as well as influencing ion channels (table iv) . allergenic effects. a number of secondary metabolites influence the immune system of animals, such as coumarins, furanocoumarins, hypericin, and helenalin. common to these compounds is a strong allergenic effect on those parts of the skin or mucosa that have come into contact with the compounds ( , , ) . activation or repression of the immune response is certainly a target that was selected during evolution as an antiherbivore strategy. the function of alkaloids in this context is hardly known. this selection of alkaloid activities, though far from complete, clearly shows that many alkaloids inhibit central processes at the cellular, organ, or organismal level, an important requisite for a chemical defense compound. however, most of the potential targets for the , alkaloids known at present remain to be established. if no activity has been reported, it often means that nobody looked into this question scientifically, and not that a particular alkaloid is without a certain biological property. summarizing this section, it is safe to assume that most alkaloids can affect animals and thus herbivores significantly. dead plants easily rot due to the action of bacteria and fungi, whereas metabolically active, intact plants are usually healthy and do not decay ( ) . how is this achieved? the aerial organs of terrestrial plants have epidermal cells that are covered by a more or less thick cuticle, which consists of waxes, alkanes, and other lipophilic natural products ( , ) . this cuticle layer is water repellent and chemically rather inert, and it thus constitutes an important penetration barrier for most bacteria and fungi. in perennial plants and in roots we find another variation of this principle in that plants often form resistant bark tissues. the only way for microbes to enter a healthy plant is via the stomata or at sites of injury, inflicted by herbivory, wind, or other accidents. at the site of wounding, plants often accumulate suberin, lignin, callose, gums, or other resinous substances which close off the respective areas ( , ) . in addition, antimicrobial agents are produced such as lysozyme and chitinase, lytic enzymes stored in the vacuole which can degrade bacterial and fungal cell walls, protease inhibitors which can inhibit microbial proteases, or secondary metabolites with antimicrobial activity. secondary metabolites have been routinely screened for antimicrobial activities by many researchers, since the corresponding assays are relatively easy to perform. these studies have usually been directed toward a pharmaceutical application, and they often employ the routine methods for screening microbial or fungal antibiotics. it may happen that these tests do not detect an antibacterial activity of a compound because the wrong test species or a nonrelevant concentration was assayed. in the pharmaceutical context we search for very active compounds which can be employed at low concentrations. therefore, the higher concentrations, which would be more meaningful ecologically, are often not tested. these precautions have to be kept in mind when screening the literature for data on the antimicrobial activity of alkaloids. secondary compounds known for their antimicrobial activity include many phenolics (e.g., flavonoids, isoflavones, and simple phenolics), glucosinolates, nonproteinogenic amino acids, cyanogenic glycosides, acids, aldehydes, saponins, triterpenes, mono-and disesquiterpenes, and last but not least, alkaloids ( , , , , ) . in table vi alkaloids are tabulated for which antibacterial activities have been detected. the alkaloids usually affect more gram-positive than gram-negative bacteria. especially well represented are alkaloids which '-hydroxytabernamine hydroxytetrahydrosecamine tetrandrine thalicarpine thalicerbine thalidasine thalidezine thaliglucinone thalistine thalistyline thalmelatine thalmirabine thalphenine thalrugosaminine thalrugosidine thalrugosine tubocurarine ( i +. active; -, no activity observed in the concentration range tested (many alkaloids were only assayed in low concentrations as microbial antibiotics); ad, agar diffusion, al, agar dilution; bg, biogram; ld, liquid culture; mic, minimal inhibitory concentration; pd, paper disk; sp, suspension; tlc, tlc disk test according to wolters and eilert ( ) . if more than one value is given, the data refer to different bacterial species tested. derive from tryptophan (indole alkaloids) and phenylalaninehyrosine, which may be due to the fact that these alkaloids have obtained considerable scientific attention since the discovery of many medicinally important compounds within these groups ( , , , , , , - ) . some of these alkaloids are highly antibiotic, with similar activities as fungal antibiotics, namely, cinchophylline ( ), dictamnine ( , fagarine ( ), stemmadine ( ), yuehchukene ( ), liriodenine ( , lysicamine ( ), oxonantenine ( ), sanguinarine ( ), solacasine ( , ) , rutacridone epoxide ( ), tryptanthrine (i@#), and tuberin ( , ) (table vi) . in many instances, when alkaloids are assessed for their antibacterial activity, they are often also tested for antifungal properties. usually yeasts and candidu are used as test organisms (table vii) . table vii lists ( i , ), thaliglucinone ( ), demissidine ( , ), solacasine ( ), soladulcidine ( , ), solasodine ( , )tidine ( , ), tomatine ( , ) , verazine ( ), cryptopleurine ( ) hydroxyrutacridone epoxide ( , tryptanthrine ( ), and tuberin ( ) . whereas the mode of action and targets of antibiotics of fungal and bacterial origin have been elucidated in many instances (see table iv ), relevant information for plant-derived compounds is scant. however, the molecular targets of some alkaloids have been determined at the general level, but not specifically for bacterial or fungal systems (table iv) that may be responsible for the antibiotic effects observed. the following interactions of alkaloids having antimicrobial properties with molecular targets of bacterial or fungal cells are likely (compare tables vi and vii with tables iv and v) . protein biosynthesis in ribosomes is affected by sparteine ( , , lupanine, angustifoline, -tigloyloxylupanine, and hydroxylupanine ( , , , , , ) . intercalation or binding to dna is influenced by fagaronine, dictamnine ( ), harman alkaloids ( , ) [binding to dna is light dependent ( )], berberine ( - , chelerythrine ( ), and sanguinarine ( , ) ; these compounds may thus inhibit important processes such as dna replication and rna transcription that are also vital for microorganisms. the stability of biomembranes may be disturbed by cepharanthine, tetrandrine, and steroidal alkaloids such as solamargine ( , solanine ( , , ), and solasonine ( ) , thus leading to an uncontrolled flux of metabolites and ions into microbial cells. inhibition of metabolically important enzymes is affected by berber- ine ( ), chelerythrine ( , ), chelidonine ( ), palmatine ( ), sanguinarine ( , ), solacongestidine ( ) , and papaverine. in contrast to antibiotics of microbial origin that could be classified as alkaloids from a chemical point of view in many instances, and which often interfere with the biosynthesis or maintenance of the cell wall (murein) (table iv) , such an interaction has not been described for plantderived compounds. since this topic has not been studied in detail it remains open whether this complex is another target for alkaloids. we can distinguish between secondary metabolites that are already present prior to an attack or wounding, so-called constitutive compounds, and others that are induced by these processes and made de now. inducing agents, which have been termed "elicitors" by phytopathologists, can be cell wall fragments of microbes, the plant itself, or many other chemical constituents ( , , - ) . the induced compounds are called "phytoalexins," which is merely a functional term, since these compounds often do not differ in structure from constitutive natural products. in another way this term is misleading, since it implies that the induced compound is only active in plant-microbe interactions, whereas in reality it often has multiple functions that include antimicrobial and antiherbivoral properties (see below). many of the antimicrobial alkaloids found are constitutively expressed and accumulated, that is, they are already present before an infection. using plant cell cultures, it was observed that some cultures start to produce new secondary metabolites when challenged with bacterial or fungal cell walls, culture fluids, or other chemical factors ( , , - ) . among the compounds found to be inducible are alkaloids such as sanguinarine and hydroxyrutacridone epoxide (see table xi ). quinolizidine alkaloids display some antimicrobial properties, besides their main role in antiherbivore defense ( ) (see table i ). on wounding, qa production is enhanced, thus increasing the already high alkaloid concentration in the plant; in other words, the antimicrobial and herbivoral effect is further amplified (table xi) ( , , ) . the reactions leading to the induction and accumulation of phytoalexins with phenolic structures have been studied in molecular detail ( , , - ) . these studies revealed that plants can detect and react rapidly to environmental problems, such as wounding or infection: within min of elicitation, mrnas coding for enzymes that catalyze the reactions leading to the respective defense compounds are increasingly generated, leading to the accumulation of the respective enzymes and consequently the production of the secondary metabolites ( , , - ) . similar processes are likely for alkaloids, but so far the mechanisms have not been elucidated. we assume that a substantial number of the , alkaloids have antimicrobial properties (which remain to be tested in most cases) that are directed against the ubiquitous and generalist microbes which have not table vi . if a range is given, the first value gives a % inhibition, the second value a % inhibition. specialized on a particular host plant. however, alkaloid production does not necessarily have to be involved with antimicrobial defense. for example, phytophthora or fusarium will attack alkaloid-rich plants of nicotiana, solanum esculentum, and s . tuberosum. cladosporium and fusarium can develop in nutrient-containing media enriched with alkaloids, and aspergillus niger can utilize alkaloids as a nitrogen source ( ). in addition, most plant species are known to be parasitized or infected by at least a few specialized bacteria or fungi which form close, often symbiotic, associations. in these circumstances an antimicrobial effect expected from the secondary metabolites present in the plant can often no longer be observed. we suggest that these specialists have adapted to the chemistry of their host plants. mechanisms may include inhibition of biosynthesis of the respective compounds, degradation of the products, or alteration of the target sites, which are then no longer sensitive toward a given compound (so-called target site modification). these mechanisms need to be established for most of the microbial specialists living on alkaloid-producing plants. some associations between plants and fungi are symbiotic in nature, such as rhizobia in root nodules of legumes or microrhizal fungi in many species. in lupines, nitrogen-fixing rhizobia are present both in alkaloid-rich and alkaloid-free plants. they must therefore be able to tolerate the alkaloids, which are also present in the root. alkaloid production in lupines is more or less unaffected whether or not the plants harbor rhizobia ( , ) . an ecologically important symbiosis between plants and fungi can be observed in fungal species that produce ergot alkaloids. graminaceous species that are infected by ergot suffer much less from herbivory because of the strong antiherbivoral alkaloids produced by the fungi ( ). a similar relationship may occur for other fungal species of plants, many of which produce secondary metabolites possessing animal toxicity. from the pharmaceutical point of view, few alkaloids are interesting as antibiotics, because many are highly toxic to vertebrates (tables i and ). since many alkaloids are antibacterial and antifungal (tables vi and vii) and are present in plants at relatively high concentrations (section iila), it seems likely that from an ecological perspective alkaloids, besides their prominant role in antiherbivore strategies, may play an important role also in the defense against microbial infections. it should be recalled that even alkaloid-producing plants synthesize antimicrobial proteins, such as chitinase and lysozyme, and other antimicrobial secondary products, such as simple phenolics, flavonoids, anthocyanins, saponins, and terpenes ( - , ) . a cooperative, or even synergistic, process could thus be operating. c. antiviral properties plants, like animals, are hosts for a substantial number of viruses, which are often transmitted by sucking insects such as aphids and bugs (heteroptera). resistance to viral infection can be achieved either by biochemical mechanisms that inhibit viral development and multiplication or by warding off vectors such as aphids in the first place. the assessment of antiviral activity is relatively difficult. as a result, only a few investigators have studied the influence of alkaloids on virus multiplication. nevertheless, at least alkaloids have been reported with antiviral properties (table viii) . only sparteine ( ) and cinchonidine ( ) have been tested for antiviral activities against a plant virus, the potato x virus. all other evidence for antiviral activities (table viii) table viii are difficult to interpret at present. polyhydroxy alkaloids, such as swainsonine, can block the action of endoplasmic reticulum-and golgi-localized glucosidases and mannosidases, which are important for the posttranslational trimming of viral envelope proteins. because alkaloids often deter the feeding of insects, such as aphids and bugs (table i ), viral infection rates may be reduced in alkaloid-rich plants. such a correlation exists for alkaloid-rich lupines (so-called bitter lupines) and low-alkaloid varieties (the so-called sweet lupines) (see table xii) . plants often compete with other plants, of either the same or different species, for space, light, water, and nutrients. this phenomenon can be intuitively understood when the flora of deserts or semideserts is analyzed, where resources are limited and thus competition intense ( , , - ) . a number of biological mechanisms have been described, such as temporal spacing of the vegetation period in which some species flower at an earlier season, when others are still dormant or ungerminated. it was observed by molisch in ( ) that plants can also influence each other by their constituent natural products, and he coined the term "allelopathy" for this process. secondary products are often excreted by the root or rhizosphere to the surrounding soil, or they are leached from the surface of intact leaves or from decaying dead leaves by rain ( , ) . both processes will increase the concentration of allelochemicals in the soil surrounding a plant, where the germination of a potential competitor may occur. allelopathy, namely, the inhibition of germination or of the growth of a seedling or plant by natural products, is well documented at the level of controlled in v i m experiments ( , , , - ) , but how it operates in ecosystems is still often a matter of controversy. it is argued, for example, that soil contains a wide variety of microorganisms which can degrade most organic compounds. thus allelochemicals might never reach concentrations high enough to be allelopathic. allelopathic natural products have been recorded in all classes of secondary metabolites. few research groups have studied the effect of alkaloids in this context, but at least alkaloids have been reported with allelopathic properties (table ix) . as can be seen from table ix , allelopathic activities can be found within nearly all structural types of alkaloids. at higher alkaloid concentrations, a marked reduction in the germination rate can be recorded regularly. more sensitive, however, is the growth of the radicle and hypocotyl. they respond to alkaloids at a much lower level, and usually a reduction in growth can be observed but sometimes also the opposite, either of which reduces the fitness of a seedling. in species which produce the compounds, the inhibitory effects can be absent, as was reported for quinolizidine alkaloids in lupines and colchicine in colchicum autumnale ( , ) . it is likely that autotoxicity is prevented either by a special modification of cellular target sites or by other mechanisms. alkaloids ( , , ), berberine ( - ), sanguinarine ( , ) and veratrum alkaloids]; inhibition of protein biosynthesis [e.g., emetine ( ) and quinolizidine alkaloids ( , , - , ) tables iv and ix) . the inhibitory action of quinolizidine alkaloids should be explained in this context ( , ) . they are very abundant in lupine seeds (up to - % dry weight). during germination, -hydroxylupanine is converted to ester alkaloids, such as -tigloyloxylupanine. the latter compound is predominantly excreted via the roots of young seedlings and in germination assays proved to be the most allelopathic qa. these alkaloids influence only heterologous systems, not the germination of lupine seeds themselves. when lupine and lepidium seeds were grown together in the same pot, growth of the lepidium seedlings was much reduced and inhibited, indicating that qas may also be relevant in the ecological context ( ) . although the number of alkaloids with known allelopathic properties is not large, owing to the limited number of studies conducted, it is clear from table ix that alkaloids can be toxic to plants, probably by interfering with basic metabolic or molecular processes. although comparably few alkaloids have been studied for their biological activities in detail, and considering that our data collection (tables i-ix) is far from complete, we can safely state that alkaloids have potent deterrent or poisonous properties in herbivorous animals, and also affect bacteria, fungi, viruses, and plants. the next question will be whether all the adverse activities of alkaloids, which are often assayed in in uitro systems only, are meaningful in nature. because most of the allelochemical activities are dose dependent (others may be synergistic, additive, etc.), the question is whether the amounts of alkaloids produced and stored in plants are high enough to be ecologically meaningful. it is difficult, and also dangerous, to make a general statement concerning alkaloid levels in plants. we must remember that alkaloid composition and levels are often tissue or organ specific ( , , ) . they may vary during the day [a diurnal cycle has been observed for qas and tropane alkaloids ( , , )l or during the vegetation period ( . , ) . furthermore, as in all biological systems, there are differences at the level of individual plants and between populations and subspecies. unfortunately, many phytochemical reports do not contain any quantitative information, or these data are given for the whole plant without realizing the above-mentioned variables. in addition, concentrations are usually given on a dry weight basis, which is appropriate in the chemical or pharmaceutical context. however, herbivores or pathogens do not feed on the dry plant in general, but on the "wet" fresh material. in the context of chemical ecology we urgently need data on a fresh weight basis. as an approximation, in this chapter we use a conversion factor of to convert dry weight to fresh weight data if only the dry weight data are available. summarizing the relevant phytochemical literature, we find that alkaloid levels are between . and % (dry weight), which is equivalent to o.oi-ls%fresh weight, or . - mg/gfresh weight. for plantscontaining quinolizidine alkaloids, actual alkaloid contents are given for a number organs or parts (table x ) , which fall in the range deduced before. we have evaluated the situation for quinolizidine alkaloids and found that the actual concentrations of alkaloids in the plant are usually much higher than the concentrations needed to inhibit, deter, or poison a microorganism or herbivore ( , , , ) . this means that plants obviously play safe and have stored more defense chemicals than actually needed. if we look at the ed,, and ld,, values given in tables through ix, it is likely that the situation is similar for other alkaloid-producing plants, but these correlations need to be experimentally established in most instances. it seems trivial that plants not only synthesize but also store their secondary products, which makes sense only in view of their ecological functions as defense compounds, since they can fulfil these functions only if the amounts stored are appropriate. achieving and maintaining the high levels of a defense compound are very demanding from the point of view of physiology and biochemistry. most allelochemicals would probably interfere with the metabolism of the producing plant if they would accumulate in the compartments where they are made ( ). whereas biosynthesis takes place in the cytoplasm, or in vesicles (berberine) or organelles such as chloroplasts (qas, coniine), the site of accumulation of water-soluble alkaloids is the central vacuole, and that of lipophilic compounds includes latex, resin ducts, or glandular hairs (e.g., nicotine) ( , ) . in this context it should be recalled that many alkaloids are charged molecules at cellular ph and do not diffuse across biomembranes easily. during recent years, evidence has been obtained that at least some alkaloids pass the tonoplast with the aid of a carrier system. the next problem is determining how the uphill transport, that is, the accumulation against a concentration gradient, is achieved. proton-alkaloid antiport mechanisms and ion trap and chemical trap mechanisms have been postulated and partially proved experimentally ( , , ) . thus, the sequestration of high amounts of alkaloids in the vacuole is a complex and energy-requiring task, which would certainly have been lost during evolution were it not important for fitness. as a rule of thumb, we can assume that all parts of an alkaloidal plant contain alkaloids, although the site of synthesis is often restricted to a particular organ, such as the roots or leaves. translocation via the phloem, xylem, or apoplastically must have therefore occurred. phloem transport has been demonstrated for quinolizidine, pyrrolizidine, and indolizidine alkaloids, and xylem transport for nicotine and tropane alkaloids ( , , ) . if the plant relies on alkaloids as a defense compound, these molecules have to be present at the right place and at the right time. alkaloids are often stored in specific cell layers, which can differ from the site of biosynthesis ( , , ) . in lupines, but also in other species ( , , alkaloids are preferentially accumulated in epidermal and subepidermal cell layers, reaching local concentrations between and mm (table x) , which seems advantageous from the point of view of chemical ecology, since a pathogen or small herbivore encounters a high alkaloid barrier when trying to invade a lupine. the accumulation of many alkaloids in the root or stem bark, such as berberine, cinchonine, and quinine, can be interpreted in a similar way. a number of plants produce laticifers filled with latex. for example, isoquinoline alkaloids in the family papaveraceae are abundant in the latex ( ), where they are sequestered in many small latex vesicles. in latex vesicles of chelidonium mujus the concentration of protoberberine and benzophenanthridine alkaloids can be in the range of . - . m, which is achieved by their complexation with equal amounts of chelidonic acid ( ). if a herbivore wounds such a plant, the latex spills out immediately. besides gluing the mandibles of an insect, the high concentration of deterrent and toxic alkaloids will usually do the rest, and, indeed, chelidonium plants are hardly attacked by herbivores. in addition, as these alkaloids are also highly antimicrobial (table iv) , the site of wounding is quickly sealed and impregnated with natural antibiotics. other well-known plants that have biologically active alkaloids in their latex belong to the families papaveraceae (genera papauer, macleya, and sanguinaria) and campanulaceae (genus lobelia) ( ) . it is intuitively plausible that a valuable plant organ must be more protected than others. alkaloid levels are usually highest during the time of flowering and fruit/seed formation. in annual species actively growing young tissue, leaves, flowers, and seeds are often alkaloid-rich, whereas in perennial ones, like shrubs and trees, we find alkaloid-rich stem and root barks in addition. all these plant parts and organs have in common that they are important for the actual fitness or for the reproduction and thus the long-term survival of the species. spiny species, which invest in mechanical defense, accumulate fewer alkaloids than soft-bodied ones ( ); examples are isoquinoline alkaloids in cacti or qas in legumes ( ) . if a plant produces few and large seeds, their alkaloid levels tend to be higher than in species with many and small seeds ( , ); thus. a plant with few and big seeds is generally a rich source of alkaloids, which makes sense in view of the defense hypothesis. these few examples show that accumulation and storage of alkaloids have been optimized in such a way that they are present at strategically important sites where they can ward off an intruder at the first instance of attack. thus, specialized locations must be regarded as adaptive. alkaloid concentrations can fluctuate during the vegetation period, or even during a day ( , ). but in biochemical terms their biosynthesis and accumulation are constitutive processes. this ensures that a certain level of defensive compounds is present at any time. furthermore, continuous turnover is a common theme for molecules of the cells whose integrity is important, such as proteins, nucleic acids, and signal molecules. the same seems to be true for a defense compound. an alkaloid which mimics a neurotransmitter, such as hyoscyamine, nicotine, or sparteine, could be oxidized or hydrolyzed in the cell by chance, and thus would be automatically inactivated. only by replacing these molecules continuously can the presence of the active compounds be guaranteed. for example, it was suggested that nicotine has a half-life of hr in nicotiana plants, and that more than % of the co, fixed passes through this alkaloid ( ). in other groups of natural products it was possible to show that plants can react to infection by microbes or to wounding by herbivores by inducing the production of new defense compounds. these compounds are termed "phytoalexins" in phytopathology ( ) ( ) ( ) . classic examples of phytoalexins include isoflavones, phenolics, terpenes. protease inhibitors, coumarins, and furanocoumarins. using plant cell cultures it could be shown that a similar process can be observed with some alkaloidal plants, which start to produce alkaloids with antimicrobial properties (e.g., sanguinarine, canthin- -one, rutacridone alkaloids) when challenged with elicitors from bacterial or fungal cell walls (table xi) . but what is the situation after herbivory? when plants are eaten by large herbivores, a de nouo synthesis would be almost useless for a plant (except maybe trees), since this would not be quick enough. the situation is different, however for small herbivores such as insects or worms, which may feed on a particular plant for days or weeks. here the de nouo production of an allelochemical would be worthwhile. there are indeed some preliminary experimental data that support this view. in liriodendron rirlipifera several aporphine alkaloids accumulate after wounding, which are otherwise not present ( ). in tobacco the produc- " cc, cell culture; pl, plant tion of nicotine, in lupines that of qas, and in atropci belleidonnu that of hyoscyamine are induced by wounding, thus increasing the already high levels of alkaloids by up to a factor of . whereas the response was seen after - hr in lupines, it took days in nicotiunu and in atropei (table xi) . we suggest that the wound-induced stimulation of alkaloid formation is not an isolated phenomenon, but rather an integral part of the chemical defense system. the induced antimicrobial and antiherbivoral responses show that plants can detect environmental stress and that secondary metabolism is flexible and incorporated in the overall defense reactions. many details on how a plant perceives and transmits information remain to be disclosed, but this will surely be a stimulating area of research in the future. although the physiology and metabolism of most alkaloids are extremely intricate ( ) and often not known, the available data suggest that they are organized and regulated in such a way that alkaloids can fulfill their ecological defense function. in other words, the alkaloids are present at the right time, the right place, and the right concentration. the aforementioned arguments strongly support the hypothesis that alkaloids serve as defense compounds for plants. besides circumstantial evidence, we would welcome critical experiments which clearly prove that alkaloids are indeed important for the fitness and survival of the plants producing them. we suggest that if a plant species which normally produces alkaloids is rendered alkaloid-free, it should have a reduced fitness because it is much more molested by microorganims and herbivores than its alkaloid-producing counterpart. for one group of alkaloids, the quinolizidine alkaloids, these experiments have already been performed ( , , , , ) . as mentioned before, qas constitute the main secondary products of many members of the leguminosae, especially in the genera lirpinus, genistu, cyfisiis, bccptisiu, thrrmopsis, sophoru, ormosici, and others ( ). lupines have relatively large seeds which contain up to - % protein, up to % lipids, and - % alkaloids. to use lupine seed for animal or human nutrition, homo scipiens, for several thousand years, used to cook the seeds and leach out the alkaloids in running water. this habit has been reported for the egyptians and greeks in the old world, and for the indians and incas of the new world. the resulting seeds taste sweet, in contrast to the alkaloid-rich ones which are very bitter. in mediterranean countries people still process lupines in the old way, and sometimes the seeds are salted afterward and served as an appetizer, comparable to peanuts. at the turn of the twentieth century, german plant breeders set out to grow alkaloid-free lupines, the so-called sweet lupines. although sweet lupines are extremely rare in nature ( in > . ), the efforts were largely successful, and at present, sweet varieties with an alkaloid content lower than . % exist for lupinus albus, l. mutabilis, l . luteus, l. angustifolius, and l . polyphyllus. as far as we know, the sweet varieties differ from the original bitter wild forms only in the degree of alkaloid accumulation. this offers the chance to test experimentally whether bitter lupines have a higher fitness than sweet ones with regard to microorganisms and herbivores. the results of these experiments were clearcut ( , , , ) (table xu). in the greenhouse, where plants are protected from herbivores or pathogens, no clear advantage was seen. when lupines were planted in the field, without being fenced in and without man-made chemical protection, however, a dramatic effect was regularly encountered, especially with regard to herbivores ( , , , ) . rabbits (cuniculus europaeus) and hares (lepus europaeus) clearly prefer the sweet plants and leave the bitter plants almost untouched, at least as long as there was an alternative food source. before dying rabbits will certainly try to eat bitter lupines. a similar picture was seen for a number of insect species, such as aphids, beetles, thrips, and leaf-mining flies (table xii) , namely, the sweet forms were attacked, whereas the alkaloid-rich ones were largely protected. the alkaloid-poor variety of l . luteus also became a host of acyrthosiphon pisii ( ). in poland, where the sweet yellow lupine is one of the more important fodder plants, the invasion of the aphids became a serious problem not only because the aphid enfeebles the plants by sucking its phloem sap, but also because it transfers a viral disease. the disease, known as lupine narrow leafness, decreases seed production in infected plants, and the infection takes place early, that is, prior to the plants' blossoming. thus, a mixed population of sweet and bitter lupines can, after a few generations, lose all sweet forms. infestation by the aphid and the following viral infection accelerate the elimination of alkaloid-poor plants, which, even without infection, are already inferior in seed production ( ). this observation again stresses the importance of alkaloids for the fitness of lupines. plant breeders have also observed that bacterial, fungal, and viral diseases are more abundant in the sweet forms, but this effect has not been documented in necessary detail. these experiments and observations clearly prove the importance of qas for lupines, but it should not be forgotten that other secondary metabolites, such as phenolics, isoflavones, terpenes, saponins, stachyose, erucic acid, and phytic acid, are also present in lupines and may exert additional or even synergistic effects. the lupine example also tells us about the standard philosophy and problems of plant breeding. with our present knowledge on the ecological importance of qas for the fitness of lupines, it seems doubtful whether the selection of sweet lupines was a wise decision. in order to grow them we have had to build fences and, worse, to employ man-made chemical pesticides, which have a number of well-documented disadvantages. it can be assumed that similar strategies, namely, breeding away unwanted chemical traits, have been followed with our other agricultural crops, with the consequence that the overall fitness was much reduced ( ). we can easily observe the reduced fitness by trying to leave crop species to themselves in the wild: they will quickly disappear and not colonize new habitats. there are, however, alternatives. taking lupines as an example, we could devise large-scale technological procedures to remove alkaloids from the seeds after harvest (similar to sugar raffination from sugar beets). at present a few companies are actively exploring these possibilities. one idea is to produce pure protein, lipids, dietary fibers from bitter seeds. a spin-off product would be alkaloids, which could be used either in medicine (sparteine is exploited as a drug to treat heart arrhythmia) or in agriculture as a natural plant protective, that is, as an insecticide ( , ) . it is evident, however, that each plant has developed its own strategy for survival. if all plants would follow the same strategy, it would be an easy life for herbivores and pathogens, since being adapted to one species would mean adapted to all species. this specialization becomes evident if we analyze the qualitative patterns of secondary metabolite profiles present in the plant. we regularly see one to five main alkaloids in a plant, but also several (up to ) minor alkaloids. this qualitative pattern is not constant, but differs among organs, developmental stages, individuals, populations, and species. normally, we classify the compounds as belonging to one or two chemical groups. this does not mean, however, that their biological activities are identical. on the contrary, the addition of a lipophilic side chain to a molecule seems to be a small and insignificant variation from the chemical point of view, but this may render the compound more lipophilic, and thus more resorbable. in consequence, its toxicity may be higher (see qas in table i ). thus, a herbivore or pathogen has to adapt not only to one group of chemicals but to the individual compounds present. as the composition of these chemicals changes, it is even more difficult for them to cope. therefore, we suggest that structural diversity and continuous variation are means by which nature counteracts the adaptation of specialists. in medicine, we do a similar thing if we want to control microbial diseases. to overcome or to prevent resistance of bacteria toward a particular antibiotic, very often mixtures of structurally different antibiotics are applied, whose molecular targets often differ. if only one antibiotic were given to all patients, the development of resistance would be much favored. it has been argued that alkaloids cannot have a significant role in plants because not all plant species produce alkaloids (only % of all plants do). these authors, such as robinson ( , have overlooked the fact that if all plants would produce one single alkaloid, even a very toxic alkaloid such as colchicine, it could be certain that nearly all herbivores would have developed a resistance toward this alkaloid. only the variation of secondary metabolites, and thus of the targets which they affect, provides a means to develop efficient defense compounds. the arguments of robinson would be correct if there were higher plants without any secondary metabolites, which, nevertheless, would thrive in nature; however, these plants are not known. from an evolutionary perspective it is not important whether the defense chemical is an alkaloid or a terpene; it is only essential that it affect certain and important targets in herbivores or pathogens. although the biological activities of many alkaloids have not yet been studied and their ecological functions remain to be elucidated or proved, we can nevertheless safely say that alkaloids are neither waste nor functionless molecules, but rather they are important fitness factors, probably mostly antiherbivore compounds. since nature obviously favored multitasking, additional activities, such as allelopathic or antimicrobial activities, are plausible. for quinolizidine and pyrrolizidine alkaloids, these multiple functions are already well documented (tables i-x) . plants that defend themselves effectively constitute an ecological niche almost devoid of herbivores and pathogens. it is not surprising that during evolution a number of organisms evolved which have specialized on a particular host plant species and found ways to tolerate, or even to exploit, the defense chemistry of their hosts ( , [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] . as compared to the huge number of potential enemies, the number of adapted specialists is usually small, and in general a "status quo" or equilibrium can be observed between the specialists (or parasites) and their hosts. a specialist is not well advised to kill its host, since this would destroy its own resources; a mutualism is more productive for survival. host plant-specific specialists occur within bacteria, fungi, and herbivores. the interaction of the former two groups is a central topic for plant pathologists. they often find that susceptible and nonsusceptible microbe strains exist. in most cases, it is not known how these microbial specialists achieved a relationship with the host plant chemistry, for example, whether they degrade secondary metabolites or whether they simply toler-ate them. many phytopathogenic bacteria and fungi produce their own secondary metabolites, which are often toxic to plants. it is assumed that these phytotoxins serve to weaken the host plants' defense, but may be this is not the whole story. many grasses are infected with fungi that produce ergot alkaloids. it has been assumed that these fungi (e.g., clauiceps) are proper parasites. in recent years, however, experimental evidence suggests that the relationship between grasses and ergot may be of a symbiotic nature ( ). ergot alkaloids are strong vertebrate toxins (tables i-iv) ; they mimic the activity of several neurotransmitters, such as dopamine, serotonin, and noradrenaline (table iv) . in fact, the impact of herbivores on populations which were highly infected by fungi was more reduced than those without. this means that the fungi exploit the nutrients of their host plants and supply them with strong poisons, which are not produced by the plants themselves. since the fungi do not kill their hosts, this close interrelationship seems to be of mutual interest. we expect that similar relationships are likely to be detected in the future. as mentioned earlier, a large number of mono-and oligophagous insects exist which have adapted to their host plants and the respective defense chemistry in complex fashions. in general, we can see the following main schemes ( , , , , , ) . in type adaptations, a species "learns" (or, as we should say, during evolution variants have been selected by natural selection which can tolerate a noxious defense compound) (a) by finding a way to avoid its resorption in the gut; (b) if resorption cannot prevented, by eliminating the toxin quickly via the malpighian tubules or degrading it by detoxifying microsomal and other enzymes; and (c) by developing a target site that is resistant to the toxin, such as a receptor which no longer bind the exogenous ligand. alternatively, in type strategies a species not only tolerates a plants' defense compound, but exploits it for its own defense or for other purposes, such as pheromones i , , ) . examples of type include manduca sexra, whose larvae live on nicoriana and other solanaceous plants. the alkaloids present in these plants, such as nicotine or hyoscyamine, are not stored but are degraded or directly eliminated with the feces ( ). in addition, it has been postulated that nicotine may either not diffuse into nerve cells or that the acetylcholine recpetor no longer binds nicotine as in "normal" animals ( ). the potato beetle (leptinotarsa decernlineata) lives on solanurn species containing steroid alkaloids, which are tolerated, but not stored, by this species, the bruchid beetle callosohruchus fasciarus predates seeds of qa-rich plants, such as laburnum anagyroides; this beetle eliminates most of the dietary cytisine with the feces ( ). examples of type are to some degree more interesting. in a number of plants alkaloids are translocated via the phloem ( ). when aphids live on these plants they are in direct contact with the alkaloids present. a number of examples are known at present which show that adapted aphids can store the dietary alkaloids. examples are the quinolizidines in aphis cytisorum, a. genistae, and macrosiphum albifrons, the pyrrolizidines in aphis jacobaea, a . cacaliaster, and aconitine in aphis aconiti ( , ) . for alkaloid-storing m . albifrons it was shown experimentally that the qas stored provide protection against carnivorous beetles, such as carabus problematicus or coccinella septempunctata ( , ) . acyrthosiphon spartii prefers sparteine-rich cytisus scoparius plants ( ); although it is likely that this species also stores qas, it has not been demonstrated to do so. larvae of the pyralid moth uresiphita reversalis live on qa-producing plants, such as teline monspessulana. the larvae store some of the dietary alkaloids, especially in the integument and also the silk glands. the uptake is both specific and selective and is achieved by a carrier mechanism. whereas alkaloids of the -oxosparteine type dominate in the plant, it is the more toxic cytisine that is accumulated by the larvae, with the oxosparteines being eliminated with the feces ( , ) . the larvae gain some protection from storing qas, as was shown in experiments with predatory ants and wasps. when the larvae pupate, most of the alkaloids stored are used to impregnate the silk of the cocoon, thereby providing defense for this critical developmental stage ( , ). the emerging moth lives cryptically, has no aposematic coloring, and does not contain alkaloids. in contrast the alkaloid-rich larvae are aposematically colored and live openly on the plants ( , ) . the larvae of the blue butterfly (plebejus icaroides) feed only on lupines, rich in alkaloids. as far as we know, the larvae do not sequester or store the dietary alkaloids ( ). helopeltis feeds on cinchona bark, which is rich in cinchonine-like alkaloids; it stores and uses them for its own defense ( ). larvae of the butterflies pachlioptera aristolochiae, zerynthia polyxena, ornithoptera priamus, and battus philenor live on arisrolochia plants and were shown to take up and sequester aristolochic acid, a carcinogenic alkaloid discussed earlier, as an effective defense compound ( , , ) . the best-studied group of acquired alkaloids are the pyrrolizidines, which are produced by plants, especially in the families asteraceae and boraginaceae ( ). some arctiid larvae of tyria jacobaea, cycnia mendica, amphicallia bellafrix, arginia cribaria, and arctia caja were shown to store the dietary pas and exploit them for their own defense ( , , , , - , ) . in tyria jacobaea, arctia caja, diacrisia sannio, phragmatobia fuligonosa, and callimorpha dominula pas are taken up and stored in the integument ( ). monarch butterflies (e.g., danaus plexipus) combine two sets of natural compounds. larvae feed on plants rich in cardiac glycosides and use them as chemical defense compounds. adult butterflies visit plants with pas, where they collect pas that are converted to pheromones or transferred to their eggs ( ,f , , , f, ) . a similar pa utilization scheme was observed with larvae of the moth utetheisa ornatrix ( , ) , where the compounds were shown to be deterrent for spiders and birds ( , ) . the chrysomelid beetle oreina feeds on pa-containing plants, such as adenostyles, and stores the dietary pas in the defense fluid ( , ). in the arctiid creatonotos transiens was observed an advanced exploitation of pas ( , , , - ) . the alkaloids are phagostimulants for larvae, which are endowed with specific alkaloid receptors. dietary pyrrolizidine n-oxides are resorbed by carrier-mediated transport. after resorption, free pas are converted to the respective n-oxides and ( s)-heliotrine to ( r)-heliotrine. the latter form is later converted to a male pheromone, ( r)-hydroxydanaidal. pas are stored in the integument, where they serve as defense compounds and are not lost during metamorphosis. in the adult moth, however, the pas are mobilized. in the female adult, pas are translocated into the ovary and subsequently into the eggs. in the male, pas are necessary for the induction of abdominal scent organs and concomitantly for the biosynthesis of pa-derived pheromones, which are dissipated from these coremata. in addition, pas are transferred into the spermatophore and thus donated to the female. a significant amount of pas is further transferred to the eggs, which thus obtain chemical protection from the pas previously acquired by both male and female larvae. marine dinoflagellates produce a number of toxins, such as saxitoxin, surugatoxin, tetrodotoxin, and gonyautoxin, that affect ion channels (table iv). these algae are eaten by some copepods, fish, and molluscs that also store these neurotoxins ( , , , , , ) . as a consequence, these animals have acquired chemical defense compounds, which they can use against predators. this discussion is not meant to be complete, but should illustrate that a number of insect herbivores exploit the chemistry of their food plants. these insects are adapted and have evolved a number of molecular and biochemical traits that can be considered as prerequisites. however, many of the respective plant-insect interactions have not yet been studied, and it is therefore likely that the acquisition of dietary defense compounds is even more widely distributed in nature than anticipated. whereas insect herbivores are often highly host plant specific, vertebrate herbivores tend to be more of the polyphagous type, although some specialization may occur. for example, grouse (lagopus lagopus) or capercaillies (tetra urogallus) prefer plants of the families of ericaceae or coniferae, and crossbills seeds of picea and abies species, which are rich in terpenes. the australian koala is oligophagous and prefers terpene-rich species of the genus eucalyptus. for approximately million years, the only true herbivorous vertebrates have been the mammals. the mesozoic reptiles disappeared following the mesophytic flora. birds, though a few species feed on seeds and berries, seldom eat leaves (except geese and grouse), and they frequently use insects, in addition to plant parts, as a food source ( ). although a single plant can be a host for hundreds of insect larvae, hundreds of plants comprise a daily menu for a larger mammal. the strategies of the polyphagous species include the following. avoidance of plants with very toxic vertebrate poisons (these species are usually labeled toxic or poisonous by man) by olfaction or taste discrimination. often such compounds may be described as bitter, pungent, bad smelling, or in some other way repellent. . sampling of food from a wide variety of sources and thus minimizing the ingestion of high amounts of a single toxin. . detoxification of dietary alleochemicals, which can be achieved by symbiotic bacteria or protozoa living in the rumen or intestines, or by liver enzymes which are specialized for the chemical modification of xenobiotics. this evolutionary trait is very helpful for homo sapiens, since it endowed us with a means to cope with our man-made chemicals which pollute the environment. carnivorous animals, such as cats, are known to be much more sensitive toward plant poisons ( ). it was suggested that these animals, which d o not face the problem of toxic food normally, are thus not adapted to the handling of allelochemicals. some animals, such as monkeys, parrots, or geese, ingest soil. for geese ( ) it was shown that the ingested soil binds dietary allelochemicals, especially alkaloids ( ). this procedure would reduce the allelochemical content available for resorption. . animals are intelligent and can learn. the role of learning in food and toxin avoidance should not be underestimated, but it has not been studied in most species. for most vertebrate herbivores, the ways they manage to avoid, tolerate, or detoxify their dietary allelochemicals have not been explored. sometimes, only domesticated animals were used in experiments, but they tend to make more mistakes in food choice than the wild animals. more evidence on this subject is available for homo sapiens, who has evolved a number of "tricks," some of them obviously not anticipated by evolution. first, man tends to avoid food with bitter, pungent, or strongly scented ingredients. as a prerequisite he needs corresponding receptors in the nose or on the tongue which evolved during the long run of evolution as a means to avoid intoxication. second, our liver still contains a set of detoxifying enzymes which can handle most xenobiotics. furthermore, some of these enzymes, such as cytochrome p. oxidase, is inducible by dietary xenobiotics. third, besides these biological adaptations, man has also used his brain to avoid plant allelochemicals. (a) many fruits or vegetables are peeled. as many alkaloids and other compounds are stored in the epidermis, for example, steroid alkaloids in potato tubers or cucurbitacins in cucurbits, peeling eliminates some of these compounds from consumption. (b) most food is boiled in water. this leads to the thermal destruction of a number of toxic allelochemicals, such as phytohaemagglutinins, protease inhibitors, and some esters and glycosides. many watersoluble compounds are leached out into the cooking water and are discarded after cooking (e.g., lupines or potatoes). (c) south american indians ingest clay when alkaloid-rich potato tubers are on the menu. since clay binds steroidal alkaloids, geophagy is thus an ingenious way to detoxify potential toxins in the diet ( ) . (d) man has modified the composition of allelochemicals in his crop plants, in that unpleasant taste components have been reduced by plant breeding. from the point of view of avoidance, this strategy is plausible, but, as was discussed earlier, it is deleterious from the point of view of chemical ecology. these plants often lose their resistance against herbivores and pathogens, which then has to be replaced by man-made pesticides. in general, only a few plants are exploited by man as food, as compared to the , species present on our planet. this means that even homo sapiens with all his ingenuity has achieved only a rather small success, indicating the importance and power of chemical plant defenses. in this context, it is worth recalling that a number of animals are able to synthesize their own defense compounds, among them several alkaloids ( , , , - ) . these animals have the common feature that they are usually slow-moving, soft-bodied organisms. marine animals, such as mol-luscs, sponges, zooanthids, and fishes, have been shown to contain a variety of alkaloids, such as acrylcholine, neosaxitoxin, murexin, pahutoxin, palytoxin, petrosine, and tetramine, that are toxic to other animals ( . , , , , , , , , , ) . a number of nemertine worms, such as amphiporus or nereis, produce alkaloids such as , -bipyridyl, anabaseine, nemertelline, or nereistoxin, which are toxic to predators such as crayfish ( , , , ,) . arthropod-made alkaloids include glomerine and homoglomerine in glomerus ( ) , adaline in adalia ( ), coccinelline, euphococcinine, and derivatives in coccinella, epilachna, and other coccinellid beetles ( , , , ) , and stenusine in stenus ( ) , which are considered to be antipredatory compounds ( , , , - ) . solenopsis ants produce piperidine alkaloids which resemble the plant alkaloid coniine. these alkaloids are strong deterrents and inhibit several cellular processes, such as electron transport chains (table iv) ( , ) . many insects indicate the content of toxic natural products by warning colors (aposematism) or by the production of malodorous pyrazines ( , , , ) . not only are lower animals able to synthesize alkaloids, but also vertebrates, especially in the class amphibia. tree frogs of the genus dendrobates accumulate steroidal alkaloids, such as batrachotoxin, pumiliotoxins a-c, gephyrotoxin, and histrionicotoxin, in their skin, which are strong neurotoxins (table iv) ( , , ) . natives have used the alkaloids as arrow poisons. similar alkaloids (i.e., homobatrachotoxin) have recently been detected in passerine birds of the genus pitohui ( ) . salamanders, salamandra maculosa, which are aposematically colored, produce the toxic salamandrine and derivatives, alkaloids of the steroidal group ( , , ). salamandrine is both an animal toxic (paralytic) and an antibiotic. toads (bufonidae) produce in their skin cardiac glycosides of the bufadienolide type, but also a set of alkaloids, such adrenaline, noradrenaline, adenine, bufotenine, or bufotoxin ( , , ). except for bufotoxin, the other chemicals are, or mimic, neurotransmitters. these examples show that alkaloids found in animals can either be derived from dietary sources (see section ,d, ) or be made endogenously. common to both origins is their use as chemical defense compounds, analogous to the situation found in plants. in animals we can observe the trend that sessile species, such as sponges and bryozoans, or slow-moving species without armor, such as worms, nudibranchs, frogs, toads, and salamanders, produce active allelochemicals ( , , , ) , but not so those with weapons, armor, or the possibility for an immediate flight. plants merely developed a similar strategy as these "unprotected" animal species. in this context it seems amazing that hardly anybody has doubted the defensive role of alkaloids in animals, whereas people did, and still do, where alkaloids in plants are concerned. evidence is presented in this overview that alkaloids are not waste products or functionless molecules as formerly assumed ( , ), but rather defense compounds employed by plants for survival against herbivores and against microorganisms and competing plants. these molecules were obviously developed during evolution through natural selection in that they fit many important molecular targets, often receptors, of cells (i.e. they are specific inhibitors or modulators), which can clearly be seen in molecules that mimic endogenous neurotransmitters (table iv; section ii,a, ,a). on the other hand, microorganisms and herbivores rely on plants as a food source. since both have survived, there must be mechanisms of adaptations toward the defensive chemistry of plants. many herbivores have evolved strategies to avoid the extremely toxic plants and prefer the less toxic ones. in addition, many herbivores have potent mechanisms to detoxify xenobiotics, which allows the exploitation of at least the less toxic plants. in insects, many specialists evolved that are adapted to the defense chemicals of their host plant, in that they accumulate these compounds and exploit them for their own defense. alkaloids obviously function as defense molecules against insect predators in the examples studied, and this is further support for the hypothesis that the same compound also serves for chemical defense in the host plant. the overall picture of alkaloids and their function in plants and animals seems to be clear, but we need substantially more experimental data to understand fully the intricate interconnections between plants, their alkaloids, and herbivores, microorganisms, and other plants. defense in animals introduction to ecological biochemistry defense mechanisms in plants herbivores: their interaction with secondary okologische biochemie allelochemicals: role in agriculture and forestry cell culture and somatic cell genetics of plants" (f. constabel and . m. wink gifttiere und ihre waften perspectives in chemoreception behavior biochemie und physiologie der sekundaren pflanzenstoffe plant metabolites biosynthese der alkaloide biochemistry of alkaloids the alkaloids: the fundamental chemistry antiseptika baerheim svendsen lloydia . n. m. rojas hernandez vallejos and . a. roveri the merck index proc. rd inf. lupin conf insect biology in the future micromolecular evolution, systematics and ecology phytochernical ecology: allelochernicals, mycotoxins and insect pheromones phytochem biogene gifte rozniki nauk rolnikzych die gift-und arzneipflanzen in mitteleuropa antibiotics: mechanism of action of antimicrobial and antitumour agents antimicrob. agents chemo pharmazeutische biologie , biogene arzneistoffe pharmazeutische biologie drug use in pregnancy the alkaloids handbook of enzyme inhibitions cold spring harbor conf primary and secondary metabolism of plant cell cultures chemical defenses of arthropods der einfluss einer pflanze auf die andere-allelopathie the science of allelopathy allelopathy lectures on insect olfaction focus on insect-plant interactions alkaloid biology and metabolism in plants molecular aspects of insect plant associations methods of plant biochemistry secondary products in plant tissue culture proc. natl the alkaloids the work of the author was supported by the deutsche forschungsgemeinschaft. i thank dr. th. twardowski for reading an earlier draft of the manuscript. key: cord- - irru authors: pazos, f. a.; felicioni, f. title: a control approach to the covid- disease using a seihrd dynamical model date: - - journal: nan doi: . / . . . sha: doc_id: cord_uid: irru the recent worldwide epidemic of covid- disease, for which there is no vaccine or medications to prevent or cure it, led to the adoption of public health measures by governments and populations in most of the affected countries to avoid the contagion and its spread. these measures are known as nonpharmaceutical interventions (npis) and their implementation clearly produces social unrest as well as greatly affects the economy. frequently, npis are implemented with an intensity quantified in an ad hoc manner. control theory offers a worthwhile tool for determining the optimal intensity of the npis in order to avoid the collapse of the healthcare system while keeping them as low as possible, yielding in a policymakers concrete guidance. we propose here the use of a simple proportional controller that is robust to large parametric uncertainties in the model used. the novel sars-cov- coronavirus, which produces the disease known as covid- , was first reported on december in wuhan, province of hubei, china. with amazing speed it spread to the majority of the countries in the world. the outbreak has been declared as a public health emergency of international concern by the world health organization (who) on jan. , and as a pandemic on march . at the moment, there is no vaccine against this virus or effective medicines to cure the disease. health systems only try to mitigate its consequences to avoid complications and fatal outcomes. this disease showed a great capacity of contagion and high fatality rates (see updated reports in [wm ] ). patients affected by this disease experience a number of symptoms, not all clearly identified at the moment, but which are mainly cough, breathing difficulties, fever, loss of taste and smell and extreme tiredness. frequently, patients develop a form of viral pneumonia that requires hospitalization and artificial mechanical ventilation in intensive care units. the large number of patients affected by this disease threatens to collapse public health systems, increasing the fatality rates by lack of available health assistance. in this context, is very important to predict the trend of the epidemic in order to plan effective strategies to avoid its spread and to determine its impact. as the contagion is produced very easily by simple contact between people, several measures were adopted by the governments, public health systems and populations in order to reduce the transmission by reducing contact rates. examples of these measures, the so called nonpharmaceutical interventions (npis) adopted during this period include the closing of schools, churches, bars, factories, quarantine or physical-distancing policies, confinement of people in their homes, lockdown, among other social impositions that produce discomfort and clearly harm the economy. this goal sparked many articles and studies recently published on the epidemic behavior. a number of them are addressed towards determining a mathematical model that represents the dynamics of different agents involved in a population affected by the disease. the dynamic described by the model aims to make possible to answer crucial issues, such as the maximum number of individuals that will be affected by the disease and when that maximum will occur, and makes key predictions concerning the outbreak and eventual recovery from the epidemic. this information allows to devise public policies and strategies to mitigate the social impact and reduce the fatality rate. the seminal work [flng + ] exemplifies and analyses different strategies to control the transmission of the virus. most of the models adopted to represent the dynamical behavior of the covid- are based on the sir model (see [abd ] and references therein). the sir model is a basic representation widely used which describes key epidemiological phenomena. it assumes that the epidemic affects a constant population of n individuals. the model neglects demography, i.e. births and deaths by other causes unrelated to the disease . the population is broken into three non-overlapping groups corresponding to stages of the disease: • susceptible (s). the population susceptible to acquire the disease. • infected (i). the population that has acquired the virus and can infect others. • recovered (r). the population that has recovered from infection and presumed to be no longer susceptible to the disease . a brief description of these compartments is given below. susceptible people are those who have no immunity and they are not infected. an individual in group s can move to group i by infection produced through contact with an infected individual. group i are people who can spread the disease to susceptible people. finally, an infected individual recovered from the disease is moved from the group i to the group r. some references (see, for example [lzg + , ssb ]) considers the group r as removed population, or closed cases, which includes those who are no longer infectious from recovery and the ones who died from the disease. the summation of these three compartments in the sir model remains constant and equals the initial number of population n . in order to describe better the spread of epidemics, many works (see, for example [kan , gv , shd , nes ]) adopted the seir model. in the seir model a fourth group denoted as exposed (e) is added between the group s and the group i: • exposed (e). the population that has been infected with the virus, but not yet in an infective stage capable of transmitting the virus to others. this compartment is dedicated to those people who are infectious but they do not infect others for a period of time namely incubation or latent period. other works (see [abd ] for example) consider an additional compartment at the end of the sir or of the seir model to distinguish between recovered and death cases: • dead (d). the population dead due to the disease. in argentina, the daily death rate is . − . at the moment, in the covid- disease is an open question if a recovered person can get re-infected. even though some cases were recently reported, the reinfection rate value appears to be statistically negligible based on early evidence. thus, these models become the sird or the seird models, respectively. other works as [tbt + ,lzg + ,ssb ] consider the existence of other groups seeking to match the models proposed with the numbers obtained from the actual covid- disease. the work presented in [gbb + ] deserves to be mentioned. this work studies the evolution of the covid- in italy, and proposes a model denoted as sidarthe, where the letters correspond to eight groups denoted as susceptible, infected, diagnosed, ailing, recognized, threatened, healed and extinct respectively. all of them are subgroups of those presented in the seir model. this model discriminates between detected and undetected cases of infection, either asymptomatic or symptomatic, and also between different severity of illness, having a group for moderate or mild cases and another one for critical cases that require hospitalization in intensity care units. the authors affirm that the distinction between diagnosed and nondiagnosed is important because nondiagnosed individuals are more likely to spread the infection than diagnosed ones, since the later are typically isolated, and can explain misperceptions of the case fatality rate and the seriousness of the epidemic phenomena. the fact of considering more groups in the sidarthe model than in the seir model allows better discrimination between the different agents involved in the epidemic evolution, as well as a better differentiation of the role played by each one. however, the fact of considering more groups implies the knowledge of more rates, probabilities and constants that determine the dynamics between the groups. many of these values are difficult to know in practice, as well as to estimate the population of some groups, such as ailing (symptomatic infected undetected). the authors choose these constants and quantities to match the model to the actual data. of course, in order to achieve the goal of better determining public policies, we believe that the existence of some of these groups in the model used is not necessary. in order to better guide the determination of public policies to mitigate the spread of the virus we propose the use of the control theory. control theory has been successfully implemented in several areas other than physical systems control, for which it was initially designed. for example in economics, ecological and biological systems, many works demonstrate the success of its implementation. of course, regardless of the area focused, a good control strategy depends on the adequate modeling of the dynamical system to be controlled. the proposal to use control in this epidemic is not new. it has been first presented in [shd ] . in this work, the authors use the seir model to show that a simple feedback law can manage the response to the pandemic for all rights reserved. no reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . maximum survival while containing the damage to the economy. however, the authors illustrate with several examples the benefits of using feedback control, but they do not present the mathematical control laws as well as they do not prove the convergence of the trajectories in the closed loop system. examples are implemented by mean of several computational experiments which illustrate the different strategies proposed. we propose here the use of a simple proportional controller, a standard tool in control theory, to calculate the control action. this variable guides how to determine npis in order to avoid the collapse of the health system while reducing the damage to the society and the economy that npis inevitably produce. this section is addressed to model adequately the disease. a suitable model should avoid making unnecessary classifications in order to obtain key data on the behavior of the epidemic. these data include number of deaths, maximum number of infected people, time at which the maximum infection rate will occur, among other information useful to prevent and reduce the damage produced by the outbreak. the seir model assumes that exposed people have been infected but are not able to transmit the virus before a latency period. we will consider that those people continue to be in the susceptible group s, whereas we consider the group e as people who have been infected but have no symptoms yet and are capable of transmitting the virus. of course, part of this group will present symptoms after an incubation time (moving to the group i) and another part will remain asymptomatic. asymptomatic people who have been diagnosed as positive also will be considered in the group i, so this group includes all known positive cases, symptomatic or not. in addition, a critical issue is the number of infected people who need hospitalization, because the public policies must try to keep this number lower than the capacity of the health care system in order to avoid its collapse. thus we define an extra group: • hospitalized (h). the infected population who need hospitalization. in the group h we do not differentiate between people hospitalized in mild condition and those in intensive care units (icus), despite the fact that the number of people in the last subgroup is a critical problem due to an even more limited capacity in icus. figure : rate processes that describes the progress between the groups in the seihrd model. we also consider the population number n as a constant, as the seir model does. the progression of this epidemic can be modeled by the rate processes described in fig. . the proposed seihrd model for the spread of the covid- disease in an uniform population is given by the following deterministic equations, which are presented normalized with respect to the total population n . the groups s, e, i, r, h and d are the state variables of the dynamical system ( ). they are always nonnegative. the time derivativesṘ andḊ are also nonnegative, because recovered people and death cannot decrease, whereasṠ is always nonpositive, because we consider that recovered people cannot be reinfected. this fact is represented in fig. because the states r and d only have input arrows and the state s only has an output arrow. the model ( ) is a nonlinear system normalized with respect the population n , considered as a constant. hence s + e + i + h + r + d = and all rights reserved. no reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . the rate processes are modeled as follows. • αse and βsi are the transmission rates of the virus between the susceptible and the exposed population (respectively, infected population). α and β are the probability of disease transmission in a single contact with exposed (infected) people times the average daily number of contacts per person and have units of /day. typically, α is grater than β, assuming that people tend to avoid contact with subjects showing symptoms or diagnosed as positive. contacts between susceptible people and hospitalized people are neglected, excepting for healthcare workers. the probability of contagion from dead people is also neglected, despite the fact that some cases were recently reported. of course, recovered people are no longer able to transmit the virus. • u ∈ [ , ] is the effectiveness of nonpharmaceutical public health interventions (npis). u = means no intervention and the epidemic grows completely free, whereas u = implies total elimination of the disease spread. • ν is the vaccination rate, at which susceptible people became unable to be infected. unfortunately, in the covid- case ν = yet. • p is the probability that exposed people develop symptoms, γ − is the average period to develop symptoms, and ζ − is the average time to overcome the disease staying asymptomatic. • p is the probability that infected people with symptoms require hospitalization, δ − is the average time between infection and the need for hospitalization, and η − is the average time in that infected people recover without hospitalization. • p is the probability of hospitalized people die, − is the average time between the hospitalization and the death, and µ − is the average time to recover after hospitalization. the parameters used in ( ) are not very precisely determined and even differ greatly in the literature consulted (see [flng + , lgwsr , kan , lzg + , gbb + , jon , org] among many other references). most of the model adopted in the references adjust these parameters to match real data from different countries. it must be taken into account that some of these all rights reserved. no reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . parameters, mainly α and β, are not independent either from the populations and their general health status or their actions. the parameters α and β are related with the basic reproduction number r , defined as the expected number of secondary cases produced by a single (typical) infection in a completely susceptible population [jon ] . r is not a fixed number, depending as it does on such factors as the density of a community, the general health of its populace, or its medical infrastructure [shd ] . this is the most important parameter to understand the spread of an epidemic. if r > , the epidemic growths and the number of infected people increases. if r < , the epidemic decreases and after a certain time disappears, when a large enough number of people acquire antibodies and the so-called herd immunity occurs. in the actual covid- disease, r was determined to be . in wuhan, china [shd ] (between . and . according to [slx + ]), ranging from . to . in italy [shd ] and even close to . [lgwsr ] . an important remark is that many works consider r depending on the npis, admitting that these actions tend to reduce this number because the contact rates between people decrease. note that npis always occur even in countries where no government action has been taken, because people spontaneously tend to stay at home and to avoid contact with others. this fact explains the disparity of this number in different countries and reported in the references (see [lgwsr ] ). here, we consider r as a constant reproduction number in the absence of any external action, i.e., as if the disease could spread completely free, which, of course, is an unrealistic scenario. specifically, the relation between the rates α and β and r , can be calculated in model ( ) as in [jon ,gbb + ], resulting in section we propose a pair of values for parameters α and β to evaluate different epidemic scenarios. the effectiveness of the npis is considered in the variable u, which determines the rate at which susceptible people become exposed. several works [lzg + ,tbt + ,gv ,ssb ] consider these parameters as time dependent, because incorporate in these parameters the impact of governmental actions among other npis. the incubation period is estimated as γ − = . days [kan , flng + , twl + ]. the probability of developing symptoms p will be roughly estimated as all rights reserved. no reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . the period to overcome the disease without presenting symptoms is ζ − = . days (deduced from [gbb + ]). the infectious period with no need of hospitalization is widely accepted as days, so η = / . the probability to need hospitalization after the infection is p = % [low ,lzg + ,ssb ], and the time from symptom onset to hospitalization is δ − = . days [slx + ]. the probability to die after hospitalization is p = % according to [wm , flng + ] , and the average time to die is − = . days [slx + ]. the average time to recovery after hospitalization is µ − = days [flng + ]. finally, as it was noted above, there is no vaccine against this disease, so ν = . remark . of course, most of these parameters are subject to large inaccuracies, and they differ greatly in the literature consulted. however, as we will show below, the proposed control method is robust for such uncertainties as well as for measurements errors characterized as unreported or undiagnosed cases and inaccuracies in the group quantities. we propose the use of control theory to determine public nonpharmaceuticals interventions (npis) in order to control the evolution of the epidemic, avoiding the collapse of health care systems while minimizing harmful effects on the population and on the economy. as noted in [shd ] , "a properly designed feedback-based policy that takes into account both dynamics and uncertainty can deliver a stable result while keeping the hospitalization rate within a desired approximate range. furthermore, keeping the rate within such a range for a prolonged period allows a society to slowly and safely increase the percentage of people who have some sort of antibodies to the disease because they have either suffered it or they have been vaccinated, preferably the latter". the action law is given by the control variable u in ( ). no intervention from the public health agencies means u = , and the disease evolves naturally without control. at the other hand u = means the total impossibility of transmitting the virus, which, of course, is an unrealistic scenario. there are several possible choices of the reference signal or set point of the control system. one of them may be a small enough number of hospitalized people to not affect the capacity of the intensive care units (icus) available in the health care system. this reference signal maybe nonconstant, it can can go up because of an increase in available beds due to capacity additions in the health care system, by creation of provisory field hospitals, among other similar measures. by other hand, we must bear in mind that the quantities of each group described in ( ) are subject to large inaccuracies, due to unreported or undiagnosed cases, except for the number of people diagnosed as positive (i), which is quite well known, the number of hospitalized people (h) and, of course, the number of deaths (d). for that reason the output variable to be fed back only can be the infected population i or the hospitalized population h. hence, the goal of the control action is to keep the number of hospitalized people lower than the set point minimizing the external intervention which produces social discomfort and clearly harm the economy. therefore, the control action should aim to solve the following constrained optimization problem: where t is a considered period and sp is the reference signal (or set point, in the case it is considered as a constant). as a reference, the world health organization recommends a number of hospital beds per population, which means an index of . , or . %. this number will be used as the sp of the closed loop control system. however, we must bear in mind that npis impact on physical contacts between susceptible and infected or exposed people. if an individual is already infected, hospitalization will be required after at most δ − = . days or after δ − + γ − = . days on average if the infection was recent. hence, there exists a delay between the adoption of npis and their consequences on hospitalization of people. if the control action if calculated based only on the number of hospitalized people, the following . days too many people may require hospitalization, exceeding the capacity for medical care. in control jargoon, it means that there are almost two weeks with the system operating in an open loop. therefore, the control action needs to be calculated as a function of the number of infected people i (the number of exposed people e is quite unknown) in order to avoid future hospitalization requirements in the next . days at most. this strategy is known as predictive control. fig. shows the closed loop control system. the variable process is the infected population i and the control signal is the effectiveness of the npi all rights reserved. no reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. of course, in practical situations it is necessary to determine which actions and at what level correspond to a certain effectiveness of npis, but this issue is outside the scope of this paper. next, we show the results of different strategies of npis applied on the seihrd model. in this first series of experiments, we apply a constant control action u, that is, the system shown in fig. is an open loop control one. we consider as initial conditions i = e = . , h = r = d = , so s = . , that is, . % of the population is diagnosed as positive the first day and . % of the population is asymptomatic infected. during the first days of the epidemic, it was logical to consider that both exposed and infected people could spread the virus at the same ratio because the contagion between humans was not known. then, this disease could spread in a completely free scenario, in which no action is taken. this scenario has been called "naif " by several authors [ssb , lzg + ]. using the expression ( ) with r = . as in [ssb ,lzg + ], and assuming that no actions are taken during the epidemic, then α=β= . . the evolution of exposed, infected, hospitalized and dead people in this case is shown in the fig. . in this "naif " scenario, and using as initial condition infected and exposed person for different population values (n > , ) , the maximum are always . % for the exposed and . % for the infected, and the times when these maximums are reached depend on the population value n as is shown in the fig. . the delay between both maximums is a constant value all rights reserved. no reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. figure : population of exposed, infected, hospitalized and deaths group with no npi. naif scenario of days. additionally, the number of dead people forecasted by this model is about . % of the total population. clearly, this "naif " scenario seemed to be unrealistic since people tend to avoid contact with subjects showing symptoms or diagnosed as positive due to the severity of the covid- disease. in consequence, as we stated before, in a more realistic scenario α is greater than β. in the rest of this paper we consider β = α/ to take into account this assumption. fig. shows the areas of every group along the time in the case with no npi actions for illustrative purposes (with β = α/ ). fig. shows the evolution of the hospitalized with different constant npis effectiveness u and the proposed sp. table reports some results extracted from these simulations. the results presented in table show that, if no mitigation policy is all rights reserved. no reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . h (u= ) h (u= . *(t- )) h (u= . ) sp= . figure : population of hospitalized group with no npi (blue), with a npi of % effectiveness (yellow), with an intervention of % effectiveness applied weeks after the appearance of the first case (light blue) and sp (red). β = α/ adopted (u = ) approximately % of the population will be infected and . % will die. on the other hand, a relatively little aggressive npi, only % of effectiveness, is efficient in reducing the final number of deaths as well as the maximum number of hospitalized people, which is a crucial issue in order not to collapse the health system (the maximum value of h reaches the sp ). moreover, a late application of this strategy, after weeks since the first case arose, also significantly reduces these numbers. in this section, we simulate the behavior of the trajectories described by the normalized system ( ) subject to a proportional control action. the objective of the control action is that the number of hospitalized people does not exceed the number of available beds. of course, this number is highly variable in different countries, and can be increased during the duration of the epidemic with the construction of field hospitals, among other resources. on the other hand, as noted in sec. , to adopt as feedback variable the number of hospitalized people may lead to an overload of the health system in the following . days, for which a predictive control must be used that consider the number of infected people i. not all infected people need hospitalization. most of the symptomatic cases are mild and remain mild in severity [low , slx + ] ( − p = %). so we consider that p = % of infected people will need hospitalization in the following δ − = . days. this number plus the number of people already hospitalized h must remain below the set point. of course, we neglect the number of beds occupied by patients hospitalized for other diseases. hence, the proportional control variable is chosen as where k p is the proportional scalar gain with values between [ , ]. note that if i = , u = , and there is no need of a public intervention because no one is going to require hospitalization on the following . days, and with k p = , if a percentage of % of the infected people is equal to the number of available beds sp − h, u = which means that the public intervention must completely avoid the transmission of the virus because all these people will require hospitalization after δ − = . days on average. another point of view is to consider that this is a tracking trajectory problem, with a time dependent reference signal equal to r(t) = sp − h(t). we consider the same initial condition than in the former series of experiments, i = e = . , i.e. . % of the population infected and presenting symptoms at the first day and we suppose that . % of the population is infected and asymptomatic. fig. shows the trajectories of the variables vs. time with a gain k p = . note that the number of people hospitalized is always smaller than the set point. fig. shows the control signal vs. time. the control signal presents a maximum value of . , and the area under the curve of the control signal vs. time is . . of course, the smaller this action, the less the damage to the population and to the economy. the constant control signal equal to . presents an area under the curve equal to and equal to . when it is applied after weeks (see table ). we must bear in mind that npis are determined by governmental or popular decisions, and hardly can change every day as the control signal calculated by the proportional controller does. thus, we consider the application of npis with effectiveness shown in fig. . the amplitudes and times of this control signal were obtained from that shown in fig. . the detail of the trajectories of the states presented in fig. shows that there are no significant changes in the results presented. the maximum number of hospitalized people is . , the final number of deaths is . , and the area under the curve u vs. time is . . all rights reserved. no reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . all rights reserved. no reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . all rights reserved. no reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . table shows the main results of the application of npis calculated using a proportional controller with different values of the scalar gain k p . in this section we consider the more realistic situation in which the parameters are partially unknown. as mentioned in sec. there are large uncertainties in the parameters, they diverge a lot according to the researched references and they are very different according to the country studied. in this series of experiments, the parameter α is randomly chosen is also randomly chosen between . and . . the parameter β between . and . . the incubation time γ − between and days.the probability to present symptoms p between % and %. the time of recovering between and days, for both symptomatic or asymptomatic people. the probability to be hospitalized p is considered as a gaussian distribution function of mean . and standard deviation of . . the time to be hospitalized δ − is chosen between and days. the probability to die p between % and %. the time to die − between and days. finally, the time of recovering from hospitalization µ − is randomly chosen between and days. in addition, we also consider that there exists some noncompliance of the nonpharmaceutical interventions. hence, we apply to the system ( ) a control signal with gaussian distribution of mean % of that calculated in ( ) with standard deviation of %, that is, we assume that there is % on average of noncompliance with the public measures adopted. the initial conditions are also i = e = . and the gain is k p = . fig. shows the trajectories of the states of the model ( ) during days since the first symptomatic case arose. all rights reserved. no reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. figure : evolution of every group over time with a proportional control action with gain k p = (left) and considering % of noncompliance of the npis policies on average. the picture at the right is a zoom of that at the left. set point equal to . . table reports some results extracted from this series of simulations. the similarity of the results reported in tables and , as well as the trajectories shown in figs. and , show that the proportional controller is robust to parameter uncertainties and to some noncomplaince of the npis, which, of course, always occurs in practice. the proportional controller proposed to guide the adoption of npis shown its efficiency to keep the number of hospitalized people below a set point given by the health system capacity. moreover, this very simple strategy is robust to parameters uncertainties and to some level of noncompliance of the all rights reserved. no reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . figure : control signal over time using a proportional controller and considering % of noncompliance of the npis policies on average. the blue curve is that calculated in ( ), the red curve is the control signal considering the randomic noncompliance. public measures. the control signal calculated by this method aims to guide the adoption of npis in order of minimizing the social impact and the economical damages. as an example, recently the argentine government relaxed some restrictions adopted in the quarantine period, allowing more economic and recreational activities in some cities. the only criterion used to adopt this measure was the number of days in which the number of infected people doubled (the so-called doubling time). even thought this decision also can be considered as a closed loop control action, the criterion adopted is a little improvised. an open question is how to translate the rate of effectiveness of the npi calculated by the controller into concrete actions adopted by governments or public health authorities. moreover, we must bear in mind that these measures cannot be continuously varied along the time, as the control signal does, but they are decisions that will remain valid for at least few days. however, although this issue is out of the scope of this paper, some decision can be changed every day, for example, the number of individuals with permission to leave their homes or the number of people allowed to get into a store, among other little decisions that can change every day according to the control variable suggests. all rights reserved. no reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . https://doi.org/ . / . . . doi: medrxiv preprint simcovid: an open-source simulation program for the covid- outbreak impact of nonpharmaceutical interventions (npis) to reduce covid mortality and healthcare demand modelling the covid- epidemic and implementation of population-wide interventions in italy análisis del covid- por medio de un modelo seir notes on r modeling and control of a campus covid- outbreak annelies wilder-smith, and joacim rockl ov. the reproductive number of covid- is higher compared to sars coronavirus coronavirus: what are asymptomatic and mild covid- ? a conceptual model for the coronavirus disease (covid- ) outbreak in wuhan, china with individual reaction and governmental action estimating the asymptomatic proportion of coronavirus disease (covid- ) cases on board the diamond princess cruise ship global health -seir model world health organization. coronavirus disease (covid- ) how control theory can help us control covid- the novel coronavirus, -ncov, is highly contagious and more infectious than initially estimated a mathematical description of the dynamic of coronavirus disease (covid- ): a case study of brazil an updated estimation of the risk of transmission of the novel coronavirus ( -ncov) effect of changing case definitions for covid- on the epidemic curve and transmission parameters in mainland china: a modelling study covid coronavirus pandemic key: cord- - hkoeca authors: furstenau, tara n.; cocking, jill h.; hepp, crystal m.; fofanov, viacheslav y. title: sample pooling methods for efficient pathogen screening: practical implications date: - - journal: biorxiv doi: . / . . . sha: doc_id: cord_uid: hkoeca due to the large number of negative tests, individually screening large populations for rare pathogens can be wasteful and expensive. sample pooling methods improve the efficiency of large-scale pathogen screening campaigns by reducing the number of tests and reagents required to accurately categorize positive and negative individuals. such methods rely on group testing theory which mainly focuses on minimizing the total number of tests; however, many other practical concerns and tradeoffs must be considered when choosing an appropriate method for a given set of circumstances. here we use computational simulations to determine how several theoretical approaches compare in terms of (a) the number of tests, to minimize costs and save reagents, (b) the number of sequential steps, to reduce the time it takes to complete the assay, (c) the number of samples per pool, to avoid the limits of detection, (d) simplicity, to reduce the risk of human error, and (e) robustness, to poor estimates of the number of positive samples. we found that established methods often perform very well in one area but very poorly in others. therefore, we introduce and validate a new method which performs fairly well across each of the above criteria making it a good general use approach. for targeted surveillance of rare pathogens, screenings must be performed on a large number of individuals from the host population to obtain a representative sample. for pathogens present at low carriage rates of % or less, a typical detection scenario involves testing hundreds to thousands of samples before a single positive is identified. although advances in molecular biology and genomic testing techniques have greatly lowered the cost of testing, the large number of negative results still renders any of specimens in order to detect just a few thousand cases. the large number of negative tests struck dorfman as being extremely wasteful and expensive and he proposed that more information could be gained per test if many samples were pooled together and tested as a group [ ] . if the test performed on the pooled samples was negative (which was very likely), then all individuals in the group could be cleared using a single test. if the pooled sample was positive, it would mean that at least one individual in the sample was positive and further testing could be performed to isolate the positive samples. this procedure had the potential to dramatically reduce the number of tests required to accurately screen a large population and it sparked an entirely new field of applied mathematics called group testing. due to practical concerns, dorfman's group testing approach was never applied to syphilis screening because the large number of negative samples had a tendency to dilute the antigen in positive samples below the level of detection [ ] . despite this, sample pooling has proven to be highly effective when using a sufficiently sensitive, often pcr-based, diagnostic assay. in fact, ad hoc pooling strategies have long been used to mitigate the costs of pathogen detection in disease surveillance programs. for example, surveillance of mosquito vector populations in the u.s. involves combining multiple mosquitoes of the same species (typically - ) into a single pool, prior to testing for the presence of viral pathogens [ ] [ ] [ ] [ ] . elsewhere, such pooling techniques have been successful in reducing the total number of tests in systems ranging from birds [ ] , to cows [ ] , to humans [ ] [ ] [ ] . in many wildlife/livestock surveillance programs, sample pooling is used to simply determine a collective positive or negative status of a population (e.g. a herd or flock) without identifying individual positive samples. while this is often appropriate and sufficient for small-to-medium scale research experiments or surveillance programs, a well designed pooling scheme can easily provide this valuable information with little additional cost. for the purposes of this paper, we will focus on pooling methods that provide accurate classification of each sample so that infected individuals can be identified. group testing theory primarily focuses on minimizing the number of tests required to identify positive samples and many nearly-optimal strategies for sample pooling have been described. from a combinatorial perspective, a testing scheme begins by examining a sample space which includes all possible arrangements of exactly k positive samples in n total samples. because the positive samples are indistinguishable from negative samples, a test must be performed on a sample or a group of samples in order to determine their status. the test is typically assumed to always be accurate, even when many samples are tested together (in practice, this is often not the case and approaches that consider test error and constraints on the number of samples per pool have been examined [ , ] ). in the worst case, all of the samples would need to be tested individually requiring n tests. the goal of group testing is to devise a strategy which tests groups of samples together in order identify the positive samples in fewer than n tests. group testing methods are generally more efficient when positive samples are sparse. as the number of positive samples increases, the number of tests will eventually exceed individual testing for all of the methods. this point has been previously estimated to be roughly when the number of positives is greater than n for sufficiently large n [ , ] . in order to establish the most optimal testing procedure, non-adaptive and adaptive pooling approaches and, in each case, we assume that the test applied to the pools is noiseless (the test will always be positive if a positive sample is present in the pool and negative otherwise) and it produces only a binary or two-state outcome (e.g. positive/negative or biallelic snp typing). number of times any two samples are included in the same pool [ ] . this is achieved by staggering the samples that are added to each pool in different sized windows or intervals ( fig ) ; importantly, the size of the windows must be greater than √ n and co-prime to minimize the intersections between samples. the number of different pooling windows (the weight) should be one greater than the expected number (upper bound) of positive samples, w =k + , to ensure accurate results. fig . dna sudoku pooling example. in this example, there are a total of n = samples. the -well plates show which samples are combined into each pool for the two different window sizes (w = and w = which are greater than √ n and co-prime). by using two different window sizes, the weight of this pooling design is w = meaning that k = w − = positive sample can be unambiguously identified in a single step using t = w + w = tests. the positive samples are decoded by finding the samples that appear most often in the positive pools. for example, if g is the only positive sample, we can detect this from the pooling results by noticing that g was added to both of the positive (red) pools while other samples in those pools were added to only one or the other. alternatively, if both g and d are positive, four samples occur with equal frequency (d , g , e , and f ) in the positive pools (red and purple) and it is impossible to determine which are the true positive samples. this ambiguity is introduced because the test was designed to handle only one positive sample. multidimensional pooling is another non-adaptive approach that is generally easier to perform than dna sudoku but can be more prone to producing ambiguous results. as the name implies, this procedure can be extended to many dimensions [ , ] , however it becomes more difficult to perform without robotics when more than two dimensions are used. in the two dimensional ( d) case, n samples are arranged in a perfectly square d grid or in several smaller but still square sub-grids [ ] . for example, when testing samples (as in fig ) , this could be achieved through a single x grid or dorfman's original pooling design for syphilis screening was an adaptive two-stage test. following this method, samples are partitioned and tested in g groups of size n. all of the samples in groups with negative results are considered to be negative and all of the samples in groups with positive results are tested individually. ignoring the constraints of the actual assay, the optimal group size that minimizes the number of tests depends on the number of positive samples, k. specifically, there should be roughly √ n k groups of size n k [ , ] . dorfman's two-stage approach was later generalized to any number of stages using li's s-stage algorithm [ ] , which can reduce the number of tests sobel and groll [ , ] , introduced several adaptive group testing algorithms based on recursively splitting samples into groups and maximizing the information from each test result. they demonstrated that this class of algorithm is robust to inaccurate estimates of k, particularly in the case of the binary splitting by halving algorithm which can be performed without any knowledge of the number of positive samples. binary splitting by halving (fig ) begins by testing all of the samples in a single pool. if the test is negative, all of the samples are negative and testing is complete, if the test is positive, the samples are split into two roughly equal groups and only one of the groups is tested in each step. if the tested half is negative, we know that all of the samples in the tested group are negative and testing is now complete for those samples. we also know that testing [ , ] . this is the only approach discussed here that does not rely on an step ). if the tested half is negative, then all of the samples in the tested half are considered to be negative and at least one negative sample is known to be present in the other non-tested half of the samples. if the tested half is positive, then it contains at least one positive sample and no information is gained about the other untested half. in either case, the method continues by halving and testing whichever group is known to contain a positive sample until a single positive sample is identified (either by individual testing, as seen in step , or by elimination, as seen in step ) . once a single positive sample is identified, the remaining unresolved samples (non-grey wells) are pooled and tested to determine if any positive samples remain and the process continues until all positive samples are identified. only one test is required per round, and in this example, it takes sequential rounds to recover both positive samples. fewer tests [ ] . as the ratio of samples to positive samples ( n k ) increases, the number of tests required to identify k positive samples approaches k log n k which is nearly optimal; however, like binary splitting by halving, the generalized binary splitting approach requires many sequential steps to complete testing. here we are introducing a new approach that we developed with the goal of finding a good balance between the number of tests, the number of steps, simplicity, and robustness. we found that many of the methods described previously focus on optimizing only one of these features usually to the detriment of the others. instead of attempting to perform the best in a single area, we wanted to take a more balanced approach and find tradeoffs that allow good performance across each of these areas. our modified -stage approach (fig ) is based on the s-stage approach but it is modified so that the number of steps is constrained to a maximum of three. at three steps, this approach requires only one additional step than ambiguous non-adaptive approaches that require two steps for complete validation. because the s-stage algorithm is already fairly robust, constraining the number of steps does not have a large impact on the number of tests required. we also modified our method to be simpler and easier to perform by borrowing the recursive subdividing used in the binary splitting approaches. in the s-stage approach, the remaining samples in each step are arbitrarily redivided into pools. not only does this make it difficult to keep track of the remaining samples spread across the plate, it can also make it more difficult to collect the samples for a pool using a multichannel pipette (e.g. step in fig ) . instead, we opted to recursively subdivide the samples from positive pools. this makes it easier to keep track of the samples that should be pooled at each stage and, because the samples are always in close proximity, they are easier to collect using a multichannel pipette (compare and ). is the number of groups tested at each step. the number of pipettings for a single channel pipette was equal to the number of samples in each of the pools that were tested. for multichannel pipettes, the number of samples in each pool was divided by the number of channels and rounded up. in cases where the samples in the pool were not in adjacent wells, additional pipettings were required. experimental validation of modified -stage approach we set up rare pathogen detection experiments in complex microbiome backgrounds to test our modified -stage approach. we used a total of samples (eight -well plates) that contained a background of µl of dna extraction from cow's milk and µl of molecular grade water. these samples originated from distinct cow milk samples and were replicated ( replicates each) to fill eight -well plates -a total of unique microbiome backgrounds. c. burnetti dna ( µl) was added to randomly chosen background samples (∼ . % carriage rate) as we verified that the spike-in was successful using a highly sensitive taqman assay designed to target the is repetitive element in coxiella burnetti [ ] . using the same taqman assay, we also verified that the target pathogen was not present in any of the unique microbiome backgrounds prior to the spike-in. to ensure a consistent amount of background dna, the milk extractions were tested to determine the amount of bacteria with a real-time pcr assay that detects the s gene and compares it to a known standard [ ] . the pooling procedure was carried out by a typical researcher looking to identify the number of sequential steps is one of the major factors that differentiates pooling methods. the major benefit of non-adaptive pooling methods is that, in some cases, all of the tests can be run at the same time which means that testing can be completed faster. clearly, the nonadaptive tests required the fewest number of steps even when the results were ambiguous, necessitating a second round of validation . for samples, the highest weight that we tested was which meant that any simulation with or the number of samples that are combined in a single pool is a very important practical concern because it can determine whether the assay can produce accurate results. in most methods, using a multichannel pipette reduced the number of pipettings by an order of magnitude in some cases. compared to the -channel pipette, the likely because the method requires many samples to be pooled at each step for many steps. the performance is slightly improved when multichannel pipettes are used but it is still the least efficient in many cases. using a single pipette, dna sudoku was not the most inefficient compared to the other methods. however, because the samples that are combined in each pool are spaced out in different intervals instead of in consecutive groups, the number of pipettings did not improve by using multichannel pipettes. this means that, in the best case, a laboratory technician would need to correctly pipette table which shows that method is more sensitive to overestimation than to underestimation. of the methods that depended on an estimate of the number of positive samples, the s-stage (fig , left) and our modified -stage approach (fig , second from left) were the most robust to misestimations of k. the number of steps was more robust in the modified -stage approach than the s-stage due to the -step constraint; however, the modified -stage was more sensitive in the number of tests in some cases (table ) . dna sudoku (fig , left) was the most sensitive method overall. overestimating of the number of positive samples caused the weight of the pooling design (w =k + ) to be set higher than it needed to be. when this happened, all of the positive samples were still unambiguously identified but each unnecessary increase in the weight required more than √ n additional tests. when the number of positive samples was underestimated, fewer tests were performed but the pooling scheme was no longer able to unambiguously identify the positive samples in a single step and a second round of verification was required. a similar pattern occurred in the d pooling simulations (fig , right) . while the grid dimensions did not directly depend on k, generally larger grids were more efficient when the number of positive samples was low and smaller grids reduced ambiguous results when the number of positive samples was high but at the cost of many more tests. however, because d pooling was constrained to two dimensions, the number of tests did not vary as drastically as dna sudoku. table . although the expected number of positive samples per plate was ∼ given the . % carriage rate, the actual number of positives ranged from to and none of the plates had exactly positive samples ( table ). the taqman assay was able to accurately identify the positive pools without any false positives or false negatives even during the first step when the number of samples per pool was the largest at . using an -channel pipette where appropriate, a total of pipettings was required to pool the samples. a total of taqman assays were performed which is ∼ % fewer than would be required to individually test samples. picking the right pooling approach for a given pathogen surveillance campaign can be a complicated decision, which is often driven by a set of conflicting constraints and complexity. dna sudoku, however, is far from optimal for monitoring rapidly changing pandemics due to its extreme sensitivity to misestimation of the carriage rate of the pathogen in population. a good middle ground between the adaptive and non-adaptive pooling approaches is the modified -stage approach -our preference in our own surveillance applications. while it is never the absolute best in any one category, it is always nearly optimal in terms of number of serial steps ( nd best), complexity ( nd best), number of tests ( th best), and extremely resilient to misestimation of the carriage rate ( nd best). the latter is particularly important, as it allows this approach to be useful for surveillance in situations with rapidly changing pathogen carriage rates (e.g. in pandemic or seasonal outbreaks), while keeping number of serial steps as low as possible for an adaptive method. testing pooled sputum with xpert mtb/rif for diagnosis of pulmonary tuberculosis to increase affordability in low-income countries estimating community prevalence of ocular chlamydia trachomatis infection using pooled polymerase chain reaction testing impediments to wildlife disease surveillance, research, and diagnostics evaluating and testing persons for coronavirus disease (covid- ) the detection of defective members of large populations kwang-ming hwang f. combinatorial group testing and its applications searching for the proverbial needle in a haystack: advances in mosquito-borne arbovirus surveillance phylogenetic analysis of west nile virus in maricopa county, arizona: evidence for dynamic behavior of strains in two major lineages in the american southwest detection of west nile virus in large pools of mosquitoes west nile virus in the united states: guidelines for surveillance, prevention, and control active surveillance for avian influenza virus infection in wild birds by analysis of avian fecal samples from the environment pooled-sample testing as a herd-screening tool for detection of bovine viral diarrhea virus persistently infected cattle sample pooling as a strategy to detect community transmission of sars-cov- high-throughput pooling and real-time pcr-based strategy for malaria detection real-time, universal screening for acute hiv infection in a routine hiv counseling and testing population group testing: an information theory perspective. foundations and trends® in communications and information theory to pool or not to pool? guidelines for pooling samples for use in surveillance testing of infectious diseases in aquatic animals a boundary problem for group testing sharper bounds in adaptive group testing dna sudoku-harnessing high-throughput sequencing for multiplexed specimen analysis discovery of rare mutations in extensively pooled dna samples using multiple target enrichment screening of a brassica napus bacterial artificial chromosome library using highly parallel single nucleotide polymorphism assays a two-dimensional pooling strategy for rare variant detection on next-generation sequencing platforms a sequential method for screening experimental variables group testing to eliminate efficiently all defectives in a binomial sample binomial group-testing with an unknown proportion of defectives a method for detecting all defective members in a population by group testing rickettsial agents in egyptian ticks collected from domestic animals bactquant: an enhanced broad-coverage bacterial quantitative real-time pcr assay key: cord- - vt authors: yadlowsky, s.; shah, n.; steinhardt, j. title: estimation of sars-cov- infection prevalence in santa clara county date: - - journal: nan doi: . / . . . sha: doc_id: cord_uid: vt to reliably estimate the demand on regional health systems and perform public health planning, it is necessary to have a good estimate of the prevalence of infection with sars-cov- (the virus that causes covid- ) in the population. in the absence of wide-spread testing, we provide one approach to infer prevalence based on the assumption that the fraction of true infections needing hospitalization is fixed and that all hospitalized cases of covid- in santa clara are identified. our goal is to estimate the prevalence of sars-cov- infections, i.e. the true number of people currently infected with the virus, divided by the total population size. our analysis suggests that as of march , , there are , infections ( . % of the population) of sars-cov- in santa clara county. based on adjusting the parameters of our model to be optimistic (respectively pessimistic), the number of infections would be , (resp. , ), corresponding to a prevalence of . % (resp. . %). if the shelter-in-place led to r < , we would expect the number of infections to remain about constant for the next few weeks. however, even if this were true, we expect to continue to see an increase in hospitalized cases of covid- in the short term due to the fact that infection of sars-cov- on march th can lead to hospitalizations up to days later. inference of the prevalence of sars-cov- in the us is complicated by the lack of widespread testing. testing is unreliable for providing prevalence of the disease in the entire population, because tested individuals are not representative of the population at large. tested individuals are selected based on symptoms, and are necessarily a subset of the total number of individuals infected. there may be many individuals with few to no symptoms that do not get tested, but nevertheless are vectors for the sars-cov- virus. testing in the bay area has been reliable enough that most individuals hospitalized for pneumonia or other complications caused by covid- are likely to be tested, and positively identified. therefore, we can use these data, along with the hospitalization rate of covid- estimated from other countries to infer the number of cases in the area that would lead to this level of hospitalization. also, given the hospitalization data, we can estimate the rate of growth of cases, and project this forward to estimate future hospitalizations. given the shelter-in-place order for our area, our hope is that r < , starting on march th. if we make the optimistic assumption that there are essentially no new infections after the shelter-in-place order, we still expect hospitalizations to increase for to days, leading to a peak hospital bed demand x to x greater than at the time of the order; our best guess is x, but the precise ratio would require modeling time lag from infection to symptoms to hospitalization more precisely, where we defer to ongoing modeling efforts. our model is to assume that the number of hospitalizations at any point in time is proportional to the number of total infections some number of days before, based on the lag time between infection and hospitalization. therefore, to estimate the number of infections on day t , we use the number of hospitalizations h(t) , and use the formula infections(t) = exp(lag time * exponential growth rate) * h(t) / hospitalization rate . this can be converted to a prevalence fraction by dividing by the population size. note that the hospitalization rate is needed to estimate the total number of infections, but not for forecasting overall hospital bed demand. one might be concerned with the above approach because not all past infections would lead to a hospitalization in lag-time days; only the new infections lead to new admissions. therefore, we should look at the increments in the number of hospitalizations to calculate the number of new infections from the lag-time days prior. then, we can sum up the number of new infections up to a given date to get the cumulative number of infections. it turns out that because of the linearity of the conversion from hospitalizations to infections, these two approaches will give approximately the same answer. therefore, we will stick to the simpler model for the results presented here. as input parameters to our model, we need an estimate of the lag time , and the rate of growth of infections , and hospitalization rate for covid- among those infected. as input data, we need the number of hospitalizations , and the size of the population from which those hospitalizations are drawn. for the lag time , we need to combine the incubation time and the time for disease progression to severe symptoms. the median incubation time is estimated to be about days . the time from having symptoms to needing hospitalization is about week, adding up to days. in the chinese data, the lag between the maximum onset proportion at january to the maximum hospitalization at february . cc-by-nd . international license it is made available under a author/funder, who has granted medrxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) the copyright holder for this preprint . is days, matching this estimate . we believe these may be slightly overestimated, and use days in our model. for the rate of growth of infections , we compared two values: the first estimated from the change in hospitalizations from march to march in the santa clara data, and the second calculated from the reported - day doubling time , . the estimate of the rate of growth of infections from hospitalizations gives a . % growth per day and the estimate from chinese data gives - % growth per day. because increases in the number of tests performed (which is growing quickly at about % per day ) affects the number of confirmed cases, relying on the growth of confirmed cases in the santa clara county will likely overestimate the growth rate of infections, so we use this as our upper bound. we can approach estimating the hospitalization rate in three ways. the first is to use the imperial college report, which puts hospitalization at about . % . the second is to use our institution's hospitalization rate among those who test positive for covid- , and adjust for the fact that for many, the disease is mild enough that they do not seek healthcare. if % of cases are mild, we can take the stanford test-hospitalization rate, which is . % ( % ci . %, . %) and divide by to get the covid- hospitalization rate of . %. the third approach is to use the hospitalization rate from china , and adjust for the fact that many infections could have been missed. this value is likely an overestimate due to substantial under-reporting in china . the number of hospitalizations are drawn from santa clara county's reports on the number of hospitalizations, using the internet archive wayback machine . therefore, these are drawn from a population of . million people. this population could be larger, or smaller, depending on whether everyone in the county goes to hospitals in the county, and whether these are the only people going to santa clara hospitals. because we do not know many of the parameters exactly, we bracket them between a lower bound and upper bound, and a best guess based on what we know so far. we use each of these (lower bound, upper bound and best guess) to obtain the number of inferred infections and prevalence. the parameters are summarized below in table . our lower bounds are a . % increase in infections per day, a day lag time and a . % hospitalization rate of the infected population. our best guesses are a % increase in infections per day, an day lag time, and a % hospitalization rate. our upper bounds are a % increase in infections per day, a day lag time, and a . % hospitalization rate. we can perform a sensitivity analysis under a variety of sampled estimates of these parameters, drawing uniformly over the range specified above, and re-running the analysis a times. below, we report the range and quartiles of these analyses. table . parameters of our model, and our optimistic and pessimistic bounds. note that because a lower hospitalization proportion leads to a higher estimate of the number of sars-cov- infections, it is listed in the "upper bound parameters" column. . cc-by-nd . international license it is made available under a author/funder, who has granted medrxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) the copyright holder for this preprint the inferred number of infections for march is , , and the lower and upper bounds are , and , , respectively. these estimates provide a prevalence of . %, with bounds of . % to . % (table ). if the shelter-in-place order worked, this would be the expected maximum prevalence in the area, until people recover. unfortunately, we will not know until about march - if this is the case, at which point we expect the number of hospitalizations to plateau. the detailed results are in table . table contains the sensitivity analysis where we consider other combinations of the parameters in the ranges provided, and rerun our analysis with randomly selected combinations. the results are similar to those reported above, although they cluster closer to the best guess of parameters. it's unclear if hospitalizations are a reliable source of data or not. one thing we noticed is that the numbers are small, and so directly fitting a model for log(hospitalizations) as a function of days since the outbreak does not give a very good estimate of the growth rate over time. this may be because doing so is sensitive to noise. however, we think that while this may be more sensitive to noise, it is less sensitive to selection biases, and therefore may serve as a more reliable estimate of prevalence than positive testing rates. there has been some discussion of deploying a randomized testing program. if the prevalence of covid- is between . % and . %, then such a prevalence estimate has implications for the size of the testing necessary to get a reliable estimate of prevalence. to have at least positive samples, somewhere between , and , randomly selected individuals would need to be tested. to have enough to have good statistical resolution may require many more. however, doing at least , tests would help us to identify if we are in the right ballpark, in terms of prevalence. one question that comes up from these analyses is when would we expect the number of infections to plateau after the shelter-in-place order, if the order were to stop or reduce the spread to below exponential growth. such an order should immediately affect the number of infections that we projected here, so that the number of infections does not grow beyond that of march th, and starts to dwindle after the - day course of the virus infection. however, because of the lag between infection and hospitalization, we expect the number of new hospitalizations to continue to increase for another days. the lag time varies between and days , therefore, we would expect to see a slight change in rate of increase of hospitalizations in about week from march th; the number of hospitalizations will still continue to increase and only the rate of increase will slow. due to such large variance in the . cc-by-nd . international license it is made available under a author/funder, who has granted medrxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) the copyright holder for this preprint . number of days at which people present at the hospital, reading too much into the day to day numbers and reactively changing policy before the days after the shelter-in-place may not be appropriate. the incubation period of coronavirus disease (covid- ) from publicly reported confirmed cases: estimation and application clinical characteristics of coronavirus disease in china nowcasting and forecasting the potential domestic and international spread of the -ncov outbreak originating in wuhan, china: a modelling study early release -risk for transportation of novel coronavirus disease from wuhan to other cities in china the covid tracking project. the covid tracking project. the covid tracking project impact of non-pharmaceutical interventions (npis) to reduce covid- mortality and healthcare demand characteristics of and important lessons from the coronavirus disease (covid- ) outbreak in china substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (sars-cov ) correcting under-reported covid- case numbers we thank robert tibshirani for his help with running the sensitivity analysis, and providing useful comments on this report. we also thank various people in the stanford medicine community for their help curating the reports and papers that we cite for selecting the parameters in our model. key: cord- -xfi p authors: trapman, pieter; meester, ronald; heesterbeek, hans title: a branching model for the spread of infectious animal diseases in varying environments date: - - journal: j math biol doi: . /s - - - sha: doc_id: cord_uid: xfi p this paper is concerned with a stochastic model, describing outbreaks of infectious diseases that have potentially great animal or human health consequences, and which can result in such severe economic losses that immediate sets of measures need to be taken to curb the spread. during an outbreak of such a disease, the environment that the infectious agent experiences is therefore changing due to the subsequent control measures taken. in our model, we introduce a general branching process in a changing (but not random) environment. with this branching process, we estimate the probability of extinction and the expected number of infected individuals for different control measures. we also use this branching process to calculate the generating function of the number of infected individuals at any given moment. the model and methods are designed using important infections of farmed animals, such as classical swine fever, foot-and-mouth disease and avian influenza as motivating examples, but have a wider application, for example to emerging human infections that lead to strict quarantine of cases and suspected cases (e.g. sars) and contact and movement restrictions. recent outbreaks of infectious diseases of animals (e.g. classical swine fever (csf), foot and mouth disease (fmd) and avian influenza (ai)) in western europe have had great impact on the economy, public life and animal health and welfare in the countries involved. during such an outbreak one would like to be able to compare the effectiveness of proposed control measures in, for example, their ability to reduce the expected final size and the expected duration of the outbreak. typical for the strategies aimed at stopping outbreaks of important diseases of farm animals, is that infected herds are removed from the population by culling upon detection. a second characteristic is that due to increasing quantity and quality of the imposed control measures, the environment that the infectious agent experiences, is changing. by this we mean that consecutive measures can make, for example, contact opportunities between herds different in different phases of the outbreak, or can make the infectious period, or rate with which infectivity is produced, differ for farms infected at different times. in most cases, mathematical methods for computing outbreak characteristics such as expected final size and expected duration, assume a constant environment in that the control measures are not compounded in time and do not lead to changes in the rates that govern epidemic spread (see e.g. [ ] ). in this paper we aim to develop stochastic methods, based on branching processes, which allow us to compare the effectiveness of control strategies during such outbreaks in situations where the environment is varying because of changes in subsequent measures of control. much work has already been done to describe the spread of classical swine fever (see e.g. [ , , ] ) and foot and mouth disease (see e.g. [ , , , ] ). in this paper, we will model the spread of infections in a much more analytic way than is done in earlier models [ , , , ] . we use an iterative method that computes properties of the spread, like the probability of a major outbreak of the infection and the final size of an epidemic (i.e. the total number of infected herds) very efficiently. furthermore we can derive some properties of the duration of the epidemic. we allow for different types of herd. in our model, it is essential that once the infection in a herd is detected, the whole herd will be culled. therefore, our main interest is the number of infected herds, but the number of infected animals in an infective herd is important for determining the infectivity of a herd and the distribution of the detection times. we therefore model the spread of the infection on two levels, namely the spread of the infections within a herd and the spread of infection between herds; both are described by a stochastic process. we use a special branching process to describe the spread of the infection among herds. the parameters of this branching process depend on the time since the infection of the herd and on the environment, which is determined by the real time. using branching processes to describe epidemics is of course not new (see e.g. [ ] ), but no theory exists that gives short-term predictions (as opposed to asymptotics) for general branching processes in varying environments, with an age-dependent birth rate. for computations, it is necessary that after a certain moment the environment is constant. to achieve this we assume that after some time no new measures will be taken and the effects of all measures taken in the past will either be constant or absent. in other words, although the values of the parameters may differ from those before certain measures were taken, the values are assumed to be constant after a given moment in time. our model can be used to predict the effects of various control measures and strategies during an ongoing outbreak. meester et al. gave a method to estimate the parameters from the data available during an outbreak, [ ] . we consider measures like single vaccination of all herds of a certain type, a total transport ban or killing of animals just after birth. some of these measures cause a varying environment. the fraction susceptible animals in a herd or the fraction susceptible herds of the total number of herds may be varying and so the infection rate may vary in time as well. in fighting outbreaks, additional measures will be implemented as soon as present measures turn out to be insufficient. a change in measure will change the values of the parameters. we assume that the changes in the environment are deterministic. we develop the theory using a classical swine fever outbreak in herds of pigs as motivating example throughout. in our paper we use the same input data as klinkenberg et al. [ ] . as mentioned in the introduction, we first need to model the spread of the infection within one herd, since the infectivity of an infective herd, and also the time at which the infection is detected in a certain herd, depend on the number of infected animals in that herd. we use t for the time elapsed since the first measures were implemented. the variable τ is used for describing the spread of the infection within the herds, it is the time since infection of a particular herd and therefore relative. the τ -clock starts ticking at the moment the first animal in the herd becomes infected. from now on, we will call τ the "infection-age" or "age" of the herd. we use only four types of disease-related parameters: µ: the recovery rate of individual animals in a herd. λ: the infection rate of individual animals within a herd. α: the per capita detection rate of infected animals within a herd. β: the rate at which one infected animal infects susceptible herds. we assume that as soon as an infection at a particular herd is detected, the whole herd will be culled instantaneously. further we assume that the rate of detection is proportional to the number of infected animals at infection-age τ , i (τ ), i.e. the detection rate is αi (τ ), where α does not depend on the infection-age. we assume that the infection in a herd develops as an autonomous process until detection. the number of infected animals in a herd therefore depends only on the infection-age τ , and not on the absolute time t. we describe the number of infective animals by an ordinary birth and death process, writing p i (τ ) for the probability of i infective animals in a herd at infection-age τ and δ ij for the kronecker delta function. we have, (see e.g. [ ] ): solving these differential equations leads to: where r = λ − µ and r = λ µ is the reproduction ratio. we assume λ > µ, otherwise the infection will typically only cause a minor outbreak within a herd, and will therefore typically not infect other herds. hence, conditioned on the event that the epidemic in a herd does not go extinct before age τ , i.e. i (τ ) > , i (τ ) is geometrically distributed with parameter ( − rp (τ )) = r− re rτ − , which is small for large τ . a geometric random variable with small parameter can be approximated by an exponential random variable with the same parameter. therefore, for large τ , i (τ ) can be approximated byh (τ ), whereh (τ ) is an exponential random variable with where h is an exponential random variable with parameter r− r . for large τ , the term r is negligible compared to e rτ . so we use the approximation where the equality is in the distributional sense. we can interpret this approximation as follows. the random variable h represents the random character of the start of the outbreak in a herd, when only a few animals are infective. if the disease does not go extinct, then after the initial phase, there are many infected animals in the herd, each of which causes an independent number of new infections per time unit. according to the law of large numbers, this means that the growth rate is eventually almost deterministic. we also use this approximation for small τ . we use the word infective for animals or herds that are able to spread the infection. we use the notation i (τ ; h) for the number of infective animals in a particular herd, with h = h given. . we assume that the infection and recovery rate of individual infected animals are independent of time and age. . it is possible to use alternatives for the birth and death process to describe the spread within a herd, for example the contact process. the contact process is appropriate when we have a situation where all animals are positioned in a row and do not change position. each animal can only infect its two nearest neighbours. we assume that recovered animals with two infective neighbours are re-infected immediately. therefore, we only consider the animals at the edge of a row of infective animals. we still assume that the recovery rate is µ. an infective animal infects each of its susceptible neighbours with rate λ. we have that let the number of infectives be i (τ ). then e(i (τ )|i (τ ) > ) = rτ + o(τ ). furthermore, we can show that var(i (τ )|i (τ ) > ) = o(τ ). this implies that for i (τ ) > we have so, in contrast to the birth and death model, we may approximate the random variable i (τ ) for large τ by a deterministic variable rτ . this makes computations much easier in this case. . in the original model, the number of animals in one herd is assumed to be very large compared to the number of infected animals. therefore, we assume that the contact rate between infected and susceptible animals is constant, and hence so is the birth rate. from data of outbreaks of csf in the past, we can see that the number of infected animals until detection of the infection within the herd, is small compared to the total number of animals in the herd, so these assumptions seem justified in this case [ ] . . the within-herd infection and recovery parameters λ and µ can be measured experimentally or from data of past or on-going outbreaks. the parameter α depends on the development of symptoms of infected animals and how attentive the farmers are. therefore, in reality this α will change at the first detection of the infection in the country (or in neighbouring countries), due to higher awareness of farmers and veterinarians. it is very difficult to estimate α for the period before the time of the first detection. for the time after the first detection, we can estimate α from data of past outbreaks in the same area or try to estimate this parameter during the ongoing epidemic. using data from past outbreaks is dangerous, because the characteristics of the virus and of the farming practice may have changed. estimation of the parameters, during an on-going outbreak is done by meester et al. [ ] . this method has some problems, e.g. the time necessary to get enough data for a reliable estimate. another problem is that in [ ] the infectivity of a herd does not depend on the "age" of the herd. this independence of age is essential for the estimations made. for our model it is not necessary that α is constant in time. it is possible to extend the model and use α(t) instead of α. . we assume that culling is the only measure which influences the within-herd spread of the infection. vaccination is assumed to show no effect on the spread in the herd. for csf this assumption is justified by the fact that vaccination will lead to immunity only after two weeks. therefore, during this first two weeks the spread of the infection is not affected by this measure. for the time after these two weeks, we assume that we can use the same speed of propagation of the infection, for computational reasons. the approximation leads to results that are conservative, that is, too pessimistic. . we do not take characteristics of individual animals into account that might cause the individuals to differ in infectivity, susceptibility or contact pattern. often age and type of species can have a substantial influence. for csf this implies we do not distinguish between the age of animals in one herd. the detection rate and infectivity of herds with many young animals does not significantly differ from the detection rate and infectivity of herds with especially older animals [ , ] . . we also use the approximation i (τ ) = h e rτ for small τ . this is not correct, but due to the small number of infective animals at small τ , the probability that the disease is detected at small τ is small too. we will see later that the number of infections in that period is as well, so the overall influence of the events while the herd is 'young' is probably not so large. in this section, we consider classical swine fever as a concrete example. we distinguish between two types of farms: multipliers (m) are,roughly speaking, farms where young piglets are born, and finishers (f ) are farms that buy piglets and fatten them. p m denotes the fraction multipliers of the total number of herds and p f is the fraction finishers. we assume that within either of these types, a birth and death model (with the same parameters for both types) describes the within-herd spread. the infectivity per non-transport contact of both types of herds develops in the same way too. however, transport contacts are only allowed from multipliers to finishers. therefore, we take a larger infectivity rate for contacts from multipliers to finishers. we define a ξ (t, τ ; h) as the infectivity of a herd at time t, while the herd was infected τ time units ago and with h = h, where ξ is a two-dimensional vector, denoting the two types of herds involved in the contact, so that ξ can be ff , f m, mm or mf . when no measures are implemented, a ξ (t, τ ; h), with h a given realisation of h , is proportional to the number of infected animals in the herd. here β ξ is a constant depending only on ξ . because the non-transport contacts all happen at the same rate we can define β m := β f m = β mm . further we write β f := β ff and β mf is β f plus some additional term for infections caused by transport of piglets from multipliers to finishers. because all non-transport infections happen at the same rate the ratio β m /β f is exactly the ratio of the multipliers to the finishers. note that the infectivity does not depend on the absolute time t. we define β by β := β m + β f , hence β m = p m β and β f = p f β. in order to consider transport contacts we also define β mf = p f β + β tr , where β tr is the proportionality factor of the part of the infection rate that is due to transport contacts. there is no term p f in front of β tr , because all transport contacts are from multipliers to finishers. we assume that the total number of herds is very large; in our computations, we assume it to be infinite. we already know that the detection-rate is given by αi (τ ; h) = αhe rτ . from this we can deduce, for given h, the probability p nd (τ ; h) that an infected herd of age τ is not yet detected is in the case where no measures are implemented, the expected number of infected multipliers by one infective multiplier up to age τ , for given h, µ mm (τ ; h), is given by: to see this, we first consider only one type of herd. we prove the following proposition: proof. the infection and the detection rate are proportional to the number of infective animals in a herd. the ratio of infection rate and detection rate is given by β/α. we call the detection of the herd and the infections by this herd events. the probability that an event is an infection is β α+β and a detection α α+β . therefore, the number of events, including detection, is described by a geometric random variable with parameter α α+β . hence the direct offspring distribution does not depend on the size at the start of the process. moreover, the same holds for the offspring of this direct offspring. in the same way we can prove that the size of the future offspring of an infective herd is independent of its age. to prove ( ), consider two types of herds. we note that added to the number of multipliers infected by one infective multiplier is given by a geometric random variable with parameter α α+β m . ( is added because the final event will be the only detection, and the number of events is described by a geometric random variable.) so the expected number of future infections of an infective multiplier is β m α , at all times. the probability that an infective herd is not yet detected at age τ is given by e − α r h(e rτ − ) . from this and proposition we deduce that the expected number of infections after age τ is given by β m α e − α r h(e rτ − ) . by subtracting the expected number of infections after age τ from the expected total number of infections by one herd we get the expected number of infections until age τ . in the same way we can deduce µ mf = ). now consider the probability p f kl that a finisher infects k multipliers and l finishers. all events (infections of multipliers, infections of finishers and detection of an infected herd) happen at a rate proportional to the number of infective animals in the infective herd. therefore, the rates are all proportional to each other. first, we only consider infections and detections, and we do not yet consider the different types of herds infected. detection occurs with rate αhe rτ and infection occurs with rate β m he rτ + β f he rτ = βhe rτ . as in the proof of proposition we can describe the total number of events,d say, by an ordinary geometric random variable with parameter α α+β . so the probability that n + events occur, i.e. n herds are infected by one finisher, is if in total n herds are infected by one finisher, we know by the lack of memory property that the number of infected multipliers, n f m , is binomially distributed with parameters n and β m β . that is, we can deduce in the same way. note that we did not need the distribution of the random variable h , we only needed the proportionality factors of the parameters. in the special case where infection and detection rates are proportional and these rates are known for the time before the first detection, we can also give the distribution function of the number of infective herds at the time of the first detection. this distribution function is the same as the distribution function of the number of direct infections by one herd, because it still holds that an event is a detection with probability α α+β and the probabilities of infections of finishers and multipliers are respectively β f α+β and β m α+β . we are also interested in the probability that a herd infects k multipliers and l finishers, before the herd reaches age τ . we consider a finisher. we write p(n f m (τ ; h) = k, n ff (τ ; h) = l) for this probability when the detection age τ d and h = h are given. the infections before age τ d occur independently of each other and have the lack of memory property. therefore, for τ ≤ τ d the times of infections of the different types of herds are described by an inhomogeneous poisson-process with rate β ξ he rτ . so the number of infections until age τ is poisson distributed with parameter hβ ξ τ e rs ds = hβ ξ (e rτ − ). now we see: note that now we know the life length distribution, the expected number of infections by a herd up to age τ and the underlying galton-watson process of the branching process, we can use the general theory of branching processes to determine some other properties, like the expected duration of the outbreak. ( [ ] , chapters and ). . we use a branching idea to describe the spread of the infection among different herds. for this approach, we assume that one herd has contacts with many other herds. in our model we do not take spatial distribution of the herds into account. as long as there are many herds in an area, the local exhaustion of susceptible herds can be ignored. if the outbreak is in an advanced state, however, local depletion of susceptible herds (either by the epidemic progression or by so-called pre-emptive culling) will certainly play a role. as long as there are relatively few infected herds in a neighbourhood, (i.e. a group of herds that have contacts with each other) we assume that the infection rate does not directly depend on this number of infected herds. measures like ring-culling (culling of all herds within a certain distance of an infected herd) and ring vaccination cannot be considered in our model. . we distinguish between two types of contacts, transport contacts and indirect contacts. we have assumed that the indirect contacts are the same for all herds. transport contacts are only from multipliers to finishers. indirect contacts include transport of the virus by wind, by visits from an infective herd to a susceptible herd etc. transport contacts are transports of infected animals from a multiplier to a finisher. . we simplify the model by assuming that the measures are implemented at the time of the introduction of the virus in the population (or, in other words, that the infection is detected immediately). it would be desirable to implement the measures from the moment of first detection. it is difficult, however, to estimate parameters like α and β for the time before the first detection, when awareness is not yet heightened by announcement of the outbreak, and when increased hygienic measures on farms are not yet taken. if we know the value of the parameters of the model for the period between introduction and first detection, we are able to estimate the probability of extinction and the expected final size of such a situation. however, these calculations still require that we use the information from our simplified model, where measures are started directly upon introduction. . the function µ mm (τ ; h) is not required for our calculations of the final size and the probability of extinction. the reason why we include this function in our paper is that this function is interesting in its own right. it gives the expected offspring τ time units after an infective herd is infected itself. in practical applications, it is possible that by contact tracing some herd is suspected of being infected τ time-units ago. this µ mm (τ ; h) is then the expected number of infected multipliers by the suspected herd, if that suspected herd is a multiplier and if it is really infected. . the proportionality of the infection rate to the detection rate is essential in this section. due to this property, we can find an underlying ordinary galton-watson process. without this proportionality, we may ask whether it is possible to give a nice expression for the generating function. we also lose the independence of h in the generating function, which gives some computational problems. . knowing the generating function of the number of infected herds at the first detection is important because if in some way it is possible to estimate parameters for the time before the first detection, we can use the distribution function of the number of infective herds at the time measures are implemented. we do not have the 'ages' of the infective herds at the first detection, but we can find a worst-case-scenario, by finding what age of a herd at t = leads to the biggest offspring. if the detection and infection rate does not depend in the same way on the number of infected animals, we cannot deduce the distribution function at the time of the first detection in this way. in the previous section, we considered the model for the spread of an infectious disease in non-varying environments. we heavily used the proportionality of the infection and detection rate of herds. this proportionality does not hold in a varying environment. therefore, we need a different approach. we consider only one type of herd; the multi-type model is a straightforward generalisation of this. we assume that the spread within a herd is not influenced by the state of the environment. the detection rate only depends on the number of infected animals in a herd and is written as αi (τ ; h), where i (τ ; h) is the number of infected animals as described in section . in this section, we assume that the within-herd spread is described by a birth-and-death process. so i (τ ; h) = he rτ . we can easily deal with other descriptions of the within-herd spread. the infection rate may change due to control measures. some measures lead to an infection rate that only changes finitely many times, other measures cause a continuously varying infection rate. we assume that in either case the environment is constant after some given time, t = t . that is the moment that measures have no more added value and therefore do not lead to new changes in the values of the model parameters. the infection rate is given by where φ(t) describes the effects of the measures implemented. we assume that φ(t) is deterministic. because the environment does not influence the within-herd spread, φ(t) does not depend on the age τ . for the rest of this section it is assumed that φ(t) = for t ≥ t . the assumption of constant environment after some given time is often realistic. with a vaccination for instance, we know the moment that all vaccinated animals no longer live. in other control measures, like a transport ban, we can vary the time t and compute the effects. we are looking for the probability of extinction of the infection, the expected final size and a generating function for the number of infected herds at a certain moment. to do this we use a discrete approximation of φ(t), so we have only a finite number of changes. because the environment is non-varying after a certain moment, we can use the ordinary theory of branching processes to get all interesting properties for herds infected after t = t . by using backward iteration we will find the relevant properties of the epidemic for a herd infected in another interval, because these properties only depend on what happens in the intervals after the interval of infection. we can compute these properties for the herd infected at t = . because an infected herd will almost surely be detected in finite time, the probability of extinction of the 'progeny' of an infected herd x, is equal to the probability of extinction of the progeny of all the herds infected by x. the probability of extinction of the "progeny" of a herd infected at time t, is denoted by q(t). so: here, the times t i are the times that herds are infected by the herd infected at time t; the expectation is over these random times of infection. the empty product is defined as . in order to make computations possible we use a discrete time approximation. we divide the positive real line into n + intervals, labeled , , . . . , n + . the the final interval n + is (t , ∞); in this final interval all the model parameters are constant. the time t = , the moment of the first infection, is not included in one of the intervals, we treat this point in our notation as interval . if the function φ(t) is discontinuous we may choose another discretisation, so that the discontinuities are on the boundaries of intervals. it is not essential that all intervals have the same length. for t in interval i, ≤ i ≤ n, q(t) and φ(t) are approximately constant. in our "discrete time model" we will write q(i) and φ(i) for the value of q and φ in interval i. for instance, we can take φ(i) to be the value of φ(t) at the midpoint of interval i. because we accept discontinuities in φ(t), this function may differ significantly for neighbouring intervals. now, with n(i, j ), the number of infections in interval j , due to one herd infected in interval i, we can write ( ) as: where the expectation is over the numbers of infections in each interval. q( ) is the probability of extinction of the progeny of the herd infected at time t = . we assume that all infections and detections take place on the midpoint of the interval, except of course for the events in the final interval, because there we can compute everything explicitly. given that a herd is not yet detected and h = h, the probability of infection in a certain interval by that herd does not depend on the number of infections in previous intervals. consider a herd infected in interval i, denote by d(i; h) = k the event that this particular herd is detected in interval k, given h = h. after detection no further infections occur. define n(i, l, k; h) as the number of infections in interval l due to a particular herd infected in interval i, while h = h and the interval of detection k, are given. now, by using independence we have (writing p h for the distribution function of h ) here the expectation inside the integral depends on h, contrary to the non-varying environment case. note that the number of infections in a certain interval is not independent of the interval of detection. it does not make any difference for the number of infections whether a herd is detected shortly after the considered interval or very long after that, but it does make a difference whether the detection is in the considered interval itself. so p(n; i, l, k; h) := p(n(i, l, k; h) = n) depends on the detection interval k. for computational reasons we suppose in this discrete model that for i ≤ n, p( ; i, i, k; h) = . we write p det (i, k; h) for p(d(i; h) = k). we are interested in q := q( ). we can easily compute all these probabilities in our model. because we assume a birth-and-death process for the within-herd spread, we have to take the random character of h into account. doing this leads to the following formulae: · · · e(q(k) n(i,k,k;h) )p det (i, k, h)dh. here we have conditioned on the time of detection and on h. we can rewrite this formula as: here q(i) depends on q(l) for l > i. as mentioned before, we can compute q(n + ) by the ordinary theory of branching processes. we use backward iteration to compute q( ). note that we also know the probability of extinction for herds infected after time t = . if we estimate (for example by contact tracing) the moment of infection of a certain herd, we can use this information to improve predictions. in almost the same way we can calculate the expected final size of the epidemic i.e. the expected total number g of infected herds. this will also be done for only one type of herd. if q < , there is a positive probability for the final size to be infinite and therefore the expected final size will always be infinite; so to have the possibility of a finite expected final size we assume q = . we denote by g(t) the size of the progeny of a particular herd, infected at time t (including this ancestor) and write g( ) = g. the expected number of herds in the progeny of herd x, including x itself, is plus the sum of the expected size of the progenies of all herds infected by this x, i.e. where the t i 's are again the random times at which infections by the element, infected on time t, occur. the empty sum is defined to be equal to . in the same way as we deduced the formulae for q(i) we deduce the formulae for e(g(i)) in the discrete approximation. we writeḠ(i) for e(g(i)): for α > β,Ḡ(n + ) is known from the ordinary theory of branching processes (see e.g. [ ] ): for φ(n + ) = ,Ḡ(n + ) = α α−β . we will give the generating function for the number of infective and infected herds at any given moment. here we will consider the generating function for the infective and infected herds at time t = t , the time after which the parameter values are considered to be constant. remember that the age of an infective herd only influences the number of infective animals in that herd. from proposition in section . , we know that the expected direct offspring of a herd, born after t = t , is independent of the age of that herd, given that the herd is not yet detected at that time. furthermore we know that the size of the offspring (infected after time t = t ) of a herd, infective at that time, does not depend on the offspring of other herds that are infective at time t = t . so for the distribution of the number of infections after time t = t , only the distribution of the number of infective herds at that time is important. by using proposition and the theory for ordinary branching processes, we can compute everything we want to know. we will determine the distribution of x, the number of infective herds at time t = t . x i is the number of infective herds at time t , from the progeny of one particular herd infected in interval i. (here the herd itself is a part of its own progeny.) we denote x by x. we also use y i for the number of infected herds in the progeny of a particular herd infected in interval i, that is detected before time t = t . the generating function of the distribution of x i and y i is we writeg i (s , s ; h) = e(s x i s y i |h). we again assume that a herd does not infect other herds in the interval wherein it becomes infected itself. so: for interval n we have: now we can determineg (s , s ) point wise. note that we only need to computẽ g (s , s ) in m + points to give a good approximation of the first m derivatives of this function for a certain point. with this derivatives we are able to compute the first m moments of the size at time t = t or to approximate p(x = n) for all n ≤ m. with this generating function we can also compute the probability of extinction and the expected final size of the epidemic. we can use the same reasoning as before. the progenies of all herds, infective at time t = t have to go extinct. this occurs with probability q x . so the probability of extinction is ∞ k= p(x = k)(q(n + )) k =g (q(n + ), ). note that with these s and s the model is exactly the same as the model in section . . for the expected final size we add the expected number of infected herds, already detected, to the expected size of the progeny of all the infective herds at time t = t (again including the herds infective at that time). this is d . we can only use this property if after time t = t the infection and detection rate are proportional to each other, because otherwise we need to know the ages of the infective herds at time t = t . in a non-varying environment, we know for q = the "speed" at which p(z(t) > ) decreases, for t → ∞, where z(t) is the number of infective individuals at time t, descending from one individual infected at time t = [ ] . so we can give an upper bound for the probability the disease is already extinct at a certain time. we do this by assuming that all infective herds at time t = t are infected at that time. so all herd at time t have age τ = . now with the notation s t for the probability that the progeny of a herd infected at time t , will not survive until time t + t and with z(t) for the number of infected herds at time t, we have that: . by using this iterative method, we can compute the expected final size of an outbreak very fast, but for computing any higher moments of this final size, we need substantially more computational effort. by using simulations, it is possible to estimate these higher moments too. (see [ ] ) . the lack of memory property is very important for our computations. we used it to write the formula of q(i) in a 'convenient' form, with expectations in front of every q(l) n . the speed of computations also heavily depends on the lack of memory property of the infection rate. in each interval the number of infections by one herd is poisson distributed. we can easily integrate h out of the formula for q(i), for poisson distributed numbers of infections. (with the given distribution of h ). in this section we use the dutch classical swine fever epidemic of as example. we use the same data as klinkenberg et al. in [ ] . however, klinkenberg et al. used simulations and for every simulation the parameters were (pseudo-)randomly chosen from the distributions of the parameters, while we used only estimations of parameters for our computations. we consider the following set of control measures: a total transport prohibition, b killing of young piglets, in combination with a breeding ban, c vaccination of all piglets (not sows) at multiplier herds, followed by recurrent vaccination of newborn piglets, d single vaccination of all pigs at finishing herds, e vaccination of piglets on arrival at finishing herds. choosing a (combination of) measure(s) is called a control strategy or scenario. the effects of different scenarios on the fraction of infective animals in a multiplier, in a finisher and the possibility of transport-infections (φ m (t), φ f (t) and φ tr (t) respectively) are given in table (this table is adapted from [ ] ). for example, the for φ tr in strategy a means that infection is not possible by transport contacts. as another example, φ f (t) = t/ for t ≤ in strategy d means that at time t a fraction t/ of the animals in a finisher is infective. note that the first strategies are in a constant environment. therefore, for those strategies we can compute properties of the spread of the disease directly. the probability of extinction is given in table . we also compute the expected final size of an epidemic; see table . we only need to compute this for the strategies with almost sure extinction. the expected number of infected multipliers, while the initially infected herd was a finisher is denoted by g f m . in a similar way we define g mm , g mf , g ff . the last column of table is the expected number of infected herds, when initially five multipliers and five finishers were infected. note that we cannot compare the expected final size with the results of klinkenberg et al. (table in [ ] ), because they use the median of simulations, not the mean. the results of klinkenberg et al. are given in table . the % confidence intervals of the final size are also taken from [ ] and were estimated by simulation. we used the generating function for the number of infected and infective herds at time t = t , the point in time after which the parameter values are assumed to be constant, to estimate the size at time t (which varies for different strategies). table . the effect of different strategies on φ f (t), φ m (t) and φ tr (t) . we estimate the first two moments of the size at time t for the four pure strategies in a varying environment b, c, d, e. (table ) especially the results of b and c are of interest because they will give an idea of the variance of the final size of a herd. next we give covariance matrices (for different strategies and different initially infected herds) of the number of infected multipliers not yet detected, the number of infected multipliers already detected, the number of infected finishers not yet detected and the number of finishers already detected, respectively. we denote by v ar i (j ) the covariance matrix for an initially infected herd of type i, while the strategy is j . by comparing the values in these matrices to the results in table , we know how much trust we may put in the expected sizes. up to now we used exact parameters. in reality we cannot estimate the parameters exactly. in order to get some insight into how the computed properties depend on the values of the different parameters, we varied one parameter while keeping the other parameters constant. the results for the strategies b and d are given in figures and . note that the scales on the axes differ for the different varying parameters. on the x-axes we put the ratio between the value of the parameter and the point estimator of that parameter. we see that for these strategies, the parameters r and r have relatively little influence on the computed quantities, while α and β do have significant influence. in a constant environment the only thing that matters is the ratio β α . for varying environments we varied α, while keeping β α and β tr α constant. the results are varying, but not very much. . we cannot estimate the parameters exactly. because the expected final size, the probability of extinction and the generating function for the number of infective herds at a certain moment heavily depend on some of the parameters, we cannot really use the computed values in a quantitative way. but we can use them to compare different scenarios, while using the same parameter values. . we computed the probability of extinction for the progeny of one infective herd, but in reality we may have more than one infected herd at the moment the first measures are implemented, so the probability of extinction may be much less than the computed q. we assumed that different infected herds infect other herds independently of each other. if we simplify the model by assuming that all herds at time t = have age , we can estimate the expected final size by the number of initially infected herds times the estimated final size of one infected herd. in the same way we can estimate the probability of extinction by the computed probability for one herd, to the power of the number of initially infected herds. from previous outbreaks of fmd, csf and ai in the netherlands, one can see that as a rule several farms are already infected at the moment the first case is suspected or confirmed. . the results are very much the same as the results of klinkenberg et al. but we do not need simulations to estimate the properties. also, our method is much faster than using simulations. in table of [ ] it is suggested that the expected final size, with initially five infected multipliers and five infected finishers, is estimated by simulations. in reality the median of simulations is used in that paper. this explains why our results differ on that point. . note that strategy abc gives an extremely high expected final size. klinkenberg et al. [ ] did not even give this final size. this is because for the given parameters the process is very near to a process with a positive probability of surviving i.e. a positive probability of an infinite final size. . using the expected size at t = t and the variance of this size, we see that the expected final size is not very informative in some scenarios. large variance may imply that a large set of "numbers of infected herds" have a not-ignorable probability, even if we know the exact parameters. . the computed covariance matrices do have values of different order in them. consider for instance strategy b. the variance of the number of finishers and multipliers already detected is much higher than the variance of the number of finishers and multipliers that are still alive at time t = t . this is because the infection rate decreases for this strategy. at the start of the process one infective herd may infect several other herds with a relatively high probability, while after some time infections become rare. this is why the expected number of infective herds at t = t is much less than the expected number of infected herds. . for constant environments we are interested in the variance of the final size. note that in all of the strategies with almost sure extinction in a constant environment, only one type of herd can be infected, so we only have to deal with one type of herd. we use the underlying galton-watson process to deduce the variance of the final size (see section . of [ ] ). if we take m for the expected number of direct infections by one infective herd, the expected final size is −m and the variance is m( +m) ( −m) . note that for a large expected final size, the variance will be of the square order of the final size. due to this large variance, we cannot give exact quantitative predictions about the final size of an epidemic. this also indicates that there is a large intrinsic uncertainty in the problem. . if the infection rates are increasing in time, the expected final size will decrease for increasing r, while for decreasing infection rates the expected final size will increase for increasing r. this is because r can be seen as a parameter describing the speed of the process in a constant environment. large values of r correspond to short generation lengths compared to small r. so, if the infection rate is increasing in time, the n-th generation for small values of r will have a larger infection rate than the n-th generation for large r. the effect of decreasing r, the parameter describing the random effects at the start of the within-herd spread, is less clear from formulae and definitions. from the figures we can only see that r has relative small effect on the expected final size and the probability of extinction. . some parameters heavily influence the expected final size and the probability of extinction. we have already seen that in a constant environment r and r do not influence these quantities. for predictions the ratios β : β tr : α are most important. so the estimation effort is best devoted towards estimating these ratios. by using a stochastic model, we could estimate the probability of extinction and the expectation of the final size of an epidemic in a varying environment. we only need that the environment is not varying anymore after some given time. we use a branching process in varying environments, depending on the age of the particles. in the constant environment the probability of extinction and the final size are known. by using an iterative process, we computed this probability and size for the time the environment is still varying. by using generating functions, we can compute many important properties, like the moments of the final size, and a lower bound for the probability that the process has gone extinct after a given time. these generating functions are very useful especially if in a constant environment the expected number of infections after a certain age does not depend on that age. this only holds if the infection and detection rate are in direct proportion, for the non-varying environment. it is difficult to estimate the parameters for the computations. some of the properties computed in this model, like the probability of extinction heavily depend on the parameters α and β, describing the infection and detection rate. the model presented in this paper may be useful to compare the effect of different measures, but it is very dangerous to use this model for absolute quantitative predictions. we used the model to describe the spread of classical swine fever. it is worth investigating if the same model, with other parameters, may be used to describe the spread of other animal diseases, with culling at detection, or even human diseases, which lead to strict quarantine of detected infected individuals (and suspected cases) and contact and movement restrictions. emerging infections such as sars are possible examples of this. quantification of the effects of control strategies on classical swine fever epidemics mathematical epidemiology of infectious diseases probability and random processes branching processes with biological applications the foot-and-mouth epidemic in great britain: pattern of spread and impact of interventions transmission intensity and impact of control policies on the foot and mouth epidemic in great-britain branching processes with deteriorating random environments the role of mathematical modelling in the control of the fmd epidemic in the uk dynamics of the uk foot and mouth epidemic -dispersal in a heterogeneous landscape modelling and prediction of classical swine fever epidemics quantification of the transmission of classical swine fever virus between herds during the - epidemic in the netherlands key: cord- -zukjh hr authors: feng, zhilan; glasser, john w.; hill, andrew n. title: on the benefits of flattening the curve: a perspective() date: - - journal: math biosci doi: . /j.mbs. . sha: doc_id: cord_uid: zukjh hr the many variations on a graphic illustrating the impact of non-pharmaceutical measures to mitigate pandemic influenza that have appeared in recent news reports about covid- suggest a need to better explain the mechanism by which social distancing reduces the spread of infectious diseases. and some reports understate one benefit of reducing the frequency or proximity of interpersonal encounters, a reduction in the total number of infections. in hopes that understanding will increase compliance, we describe how social distancing a) reduces the peak incidence of infections, b) delays the occurrence of this peak, and c) reduces the total number of infections during epidemics. in view of the extraordinary efforts underway to identify existing medications that are active against sars-cov- and to develop new antiviral drugs, vaccines and antibody therapies, any of which may have community-level effects, we also describe how pharmaceutical interventions affect transmission.  social distancing refers to non-pharmaceutical measures to reduce the frequency or proximity of interpersonal encounters  the impact of these measures on epidemic curves is commonly misrepresented, suggesting a lack of understanding of the underlying mechanisms  as this may affect compliance with recommendations, we describe determinants of the magnitude and timing of peak incidence and the total number of infections  we also describe possible population-level effects of pharmaceutical interventions the most recent community mitigation guidelines for pandemic influenza (qualls et al. ) include a graphic illustrating the goals of non-pharmaceutical interventions, slowing the rate at which new infections occur (incidence), reducing the peak number of infected people (prevalence) and concomitant demands on healthcare facilities and personnel, and decreasing overall infections and deaths. antimicrobial drugs generally are administered solely for the benefit of individual patients, but may also reduce the magnitude or duration of infectiousness, affecting transmission to others. similarly, vaccines and monoclonal antibodies may reduce infectiousness as well as susceptibility. accordingly, we include pharmaceutical interventions in our description of mechanisms underlying efforts to flatten the curve. plots of daily numbers of new infections, called epidemic curves, are affected by several factors, including under-reporting. individual responses to infection by the same pathogen differ, and mild infections are less likely to be reported than those requiring medical attention. absent widespread testing, few asymptomatic infections would be reported. in the ongoing pandemic, people with few if any symptoms may be infectious , diminishing the effectiveness of control measures based solely on symptomatic infections. social distancing -increasing inter-personal distances, reducing the number of people attending meetingsinvolves everyone, reducing the risks of being infected if susceptible and of infecting others if infectious. assuming that infected people are infectious upon infection, other factors affecting epidemic curves are person-to-person contact rates (i.e., contacts per person per time period) and probabilities of infection on contact between infectious and susceptible people, or infectiousness. the sum of contacts by which susceptible people would be infected, which are products of these rates and probabilities (hethcote ) , estimates the basic reproduction number, or average number of secondary infections per infected person. the areas beneath epidemic curves represent total numbers of infections. the onset, shape and area beneath these curves all depend on the basic reproduction number, which must be greater than one for epidemics to occur. the smaller this number, the longer the delay to peak incidence, the lower the peak, and fewer the total number of infections (figure ). both of the basic reproduction number's constituents may be modifiable, contact rates by social distancing and infectiousness by medications, whereupon they would be termed effective reproduction numbers. figure . epidemic curves with varying contact rates. the contact rates decrease by from to per day (left to right), corresponding to reproduction numbers ranging from . to . . as these numbers decrease, the percent of the population infected decreases from % to %. thus, social distancing not only delays and diminishes the peak number of infections, it also reduces the total number of infections. while drugs are administered to cure infections or mitigate symptoms, they may also affect the magnitude or duration of infectiousness. clinical trials of remdesivir and other promising antiviral drugs are underway (https://clinicaltrials.gov/ct /results?cond=covid- ), but none has yet been proven effective against sars-cov- . this is the mechanism underlying the "treatment as prevention" approach to reducing hiv infection rates. treatment of infected people with anti-retroviral drugs reduces their viral load, preventing their progression to aids. authorities believe that it also reduces their infectiousness, which would lower the effective reproduction number. this is the rationale for the us national initiative, "ending the hiv epidemic: a plan for america" (https://www.hiv.gov/federal-response/ending-the-hiv-epidemic/overview). similarly, unprecedented efforts to develop vaccines (le et al. ) or monoclonal antibodies (e.g., wang et al. ) against sars-cov- also are underway. should any prove safe and effective, they may reduce infectiousness as well as susceptibility to infection. but the only way to flatten epidemic curves without also affecting the areas beneath them -as some figures in the media have suggested -is to change the mean interval between being infected and infecting others, or generation time. the pathogenesis of infectious diseases typically is characterized by replication prior to dissemination (nash et al. ) , so that people are not infectious for a while after being infected, called the latent period. suppose -for sake of argument -that we could modify this period (figure ). figure . epidemic curves with varying latency. intervals from infection to the onset of infectiousness or latent periods increase from to in -day increments (left to right), all corresponding to a basic reproduction number of . . note that the thick red curves are identical in both figures and that, in this one, while the curves differ in shape, the same total percent of the population, % is infected. j o u r n a l p r e -p r o o f should some infected people not live long enough to cause the average number of secondary infections, the basic reproduction number and generation time would no longer be independent. be that as it may, the goal is not just to delay and reduce the peak number of infections, possibly below surge capacity as some figures in the media have suggested, it is also to reduce the total number of infections. social distancing accomplishes all three objectives: a) reducing the peak incidence of infections, b) delaying the occurrence of this peak, and c) reducing the total number of infections during epidemics. rate (contacts per person per time period),  is the probability of infection on contact between a susceptible and an infectious person, and n is the total population size. the mathematics of infectious diseases the covid- vaccine development landscape substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (sars-cov- ) pathogenesis of infectious disease community mitigation guidelines to prevent pandemic influenza -united states a human monoclonal antibody blocking sars-cov- infection in the model that we used to generate these figures, the host population is partitioned into those who are susceptible to infection, s; who have been infected, but are not yet infectious, e; who are infectious, i; who have recovered and are immune, r; and who have died, d.in this system of ordinary differential equations,  is the force or hazard rate of infection,  is the rate of disease progression (its reciprocal is the latent period),  is the rate of recovery (its reciprocal is the infectious period), p is the proportion surviving (thus, disease-induced mortality is p), a is the contact j o u r n a l p r e -p r o o f key: cord- -fzyamd v authors: peiro-garcia, alejandro; corominas, laura; coelho, alexandre; desena-decabo, lidia; torner-rubies, ferran; fontecha, cesar g. title: how the covid- pandemic is affecting paediatric orthopaedics practice: a preliminary report date: - - journal: j child orthop doi: . / - . . sha: doc_id: cord_uid: fzyamd v purpose: since the state of alarm was decreed in spain on march , the coronavirus disease (covid- ) pandemic has had an extraordinary impact in paediatric hospitals. this study shows the effect of the pandemic on our practice in paediatric orthopaedics in a referral third level paediatric hospital. methods: we performed a single-centre retrospective review of the official census from a third level paediatric hospital from march to april for the years , and . results: the patients seen in our clinic during this period in decreased in by % (p < . ) compared with and , however, the number of telemedicine consultations increased by . % (p < . ). the total number of patients attending the clinic (including onsite and virtual) was reduced by . % (p < . ). the total surgeries performed plummeted by % in this period in (p < . ) due to a reduction in elective cases of . % (p < . ). no significant decrease was found in the number of urgent surgical cases per day in (p = . ). finally, the number of orthopaedic patients admitted to our emergency department dropped by . % during the state of alarm (p < . ). conclusion: according to our results, the pandemic has significantly affected our daily practice by decreasing elective surgeries and onsite clinics, but other activities have increased. as we have implemented telemedicine and new technologies to adapt to this setback, we should take advantage of the situation to change our practice in the future to better allocate our health resources and to anticipate outbreaks. published without peer review. level of evidence: iv since december , a novel coronavirus disease (covid- ) caused by severe acute respiratory syndrome coronavirus (sars-cov- ), has emerged from wuhan, hubei province, china. , paediatric clinical manifestations are not typical, and relatively milder, compared with adult patients. as covid- is having a major impact on all aspects of healthcare delivery worldwide, this has created unique challenge for children's hospitals regarding patients' and health workers' safety and their role in containing the spread of covid- through the community. in this context, paediatric hospitals around the world are not on the 'frontline', and as a consequence, these hospitals have become centres for the whole paediatric pathology in other health areas since adult and general hospitals needed to allocate all their health resources for the assistance of adults with the covid- infection. besides, changes to orthopaedic clinical practice have been largely guided by three main, overarching principles: a) clinical urgency; b) patient and healthcare worker protection; and c) extraordinary increase of healthcare resources. patients requiring urgent or early orthopaedic care are still being attended to at the earliest possible opportunity without any difference in routine workflows. other elective surgical cases have been postponed allowing hospitals to free up beds and respirators for the treatment of patients with confirmed or suspected covid- . on the other hand, outpatients have been individually screened by clinicians and separated into three levels: level : must be seen in person, clinical issue is urgent and physical exam essential; level : appropriate for a telephone or telemedicine consultation; level : visit should be rescheduled. other considerations taken into account have been to divide teams to avoid contact between the members of the team to decrease the risk of infection. finally, covid- infection is having an extraordinary impact on national health systems around the world, and specifically in monographic paediatric hospitals due to the considerations mentioned above. since the state of alarm was decreed in spain on march , a significant modification of our daily practice has been necessary. likewise, the level of patient flow to external outpatient clinics and the demand for care has changed considerably. this has been previously observed in major televised sport events [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] to the best of our knowledge, this is the first european study in a national and international referral paediatric centre evaluating the effect of the covid- pandemic on paediatric trauma and orthopaedic pathology. daily patient census figures from a third level paediatric hospital (hospital sant joan de deu barcelona) were reviewed retrospectively. census data from march to april , including our paediatric orthopaedics outpatient clinic, paediatric trauma emergency department (ed) and paediatric orthopaedic and trauma surgical cases were reviewed to compare the effects of the covid- outbreak. we reviewed both and to avoid the effect of easter holidays during this period in . for the main outcome, the total mean number of cases per day of the three years and three tiers were compared. as secondary outcomes, we also included timeframe of patient visits to the ed, level of triage, type of surgery (elective or urgent) and type of consultation (onsite or telemedicine (telephone or video call depending on patient requirements)). all data were analyzed using ibm spss software (ibm spss inc., armonk, new york). univariate statistical analysis consisted of a student two-tailed t-test to compare the outcomes of mean number of consultations (including onsite and telemedicine), mean number of surgical procedures (including elective and urgent) and emergencies between and and (including triage level). a p-value of < . was considered statistically significant. in all, % confidence intervals were compared to determine significant differences between both years. the number of patients who attended our orthopaedic clinic onsite, per day, has been reduced from a mean of . (sd . / ) patients in this period from to to . (sd . / ) patients in the same period of . this means a decrease of % (mean difference . patients). however, the number of virtual consultations such as phone calls, video conferences and meetings on a virtual platform increased substantially, from a mean of . ( . / . ) patients per day in to to . patients per day. this change represents an increase in telephone/online consultations of . % (p < . ) ( table ). if we consider these phone and onsite consultations, the mean number of patients attending per day in the period was . (sd . / ) patients compared with . (sd . / ) patients, resulting in a reduction of . % (p < . ). data referring to patients who attended the clinic are represented in figures and . as seen in figure , the number of clinics in the third week of was decreased due to the easter holidays. no patients appear in the fifth week of because there was only one day with clinics held ( april). in figure , *univariate statistical analysis consisted of a student two-tailed t-test to compare the outcomes of mean number of consultations (including onsite and telemedicine), mean number of surgical procedures (including elective and urgent) and emergencies between , and (including triage level). a p-value of < . was considered statistically significant. we can see how the number of telemedicine consultations decreases after the first three weeks in because the easter holidays were in the fourth week ( april to april) and some of the clinics were previously rescheduled as during this period some doctors take vacation for easter holidays. if we compare the number of surgical cases during this period in the last two years, in to we performed on mean . (sd . / . ) surgeries per day, including a mean of . (sd . / . ) elective cases and a mean of . ( . / . ) urgent cases; while during the state of alarm, the mean number of surgical cases per day was . (sd . / . ), including . (sd . / . ) elective cases per day and . (sd . / . ) urgent cases per day. this means a reduction of % of the surgical cases per day, mostly due to a reduction of the elective cases of . % (p < . ). however, no significant decrease was found in the mean number of urgent surgical cases per day (p = . ) ( table ) . among the elective cases, there were three oncological cases, one prominent hardware removal, one symptomatic high-grade spondylolisthesis and one patient with an external fixation for limb deformity who fell and had a loss of correction. all these patients tested negative for covid- . this data is represented in figure . as seen in figure , we found a decrease of cases in the third week of due to easter holidays in our country in that year. the census also showed that the number of trauma and orthopaedic patients who attended our paediatric ed plummeted during the state of alarm. the severity of those patients was higher. however, this was not found to be significant (p = . ). as seen in the previous paragraph, from all the patients seen at the ed, required surgery ( . %), including five wounds, fractures and one case of septic arthritis that required surgical debridement. data from the emergencies attended are represented in figure . covid- has been spreading all over the world during the last few months of . in spain, the first case reported was a tourist visiting la gomera island on january. since then, the number of cases has increased exponentially and nowadays the number of cases in our country overpass with more than deaths by april according to official data from the spanish government. reported a series of children with suspected or confirmed covid- . the authors found that % of virologically confirmed cases had asymptomatic infection, and this rate almost certainly underestimates the true rate of asymptomatic infection because many children who are asymptomatic are unlikely to be tested. among children who were symptomatic, % had dyspnea or hypoxemia (a substantially lower percentage than the rate reported for adults ) and . % progressed to acute respiratory distress syndrome or multi-organ system dysfunction (a rate that is also lower than that seen in adults). preschool-aged children and infants were more likely than older children to have severe clinical manifestations. according to cruz and zeichner, many infectious diseases affect children differently from adults, and understanding those differences can yield important insights into disease pathogenesis, informing management and the development of therapeutics. this will likely be true for covid- , just as it has been for other infectious diseases. in this uncertain situation, there has been a rush by scientific journals to publish updated data. however, data in the children's population is still scarce. preliminary examination of characteristics of covid- disease among children in the united states suggests that children do not always have a fever or cough as reported signs and symptoms. although most cases reported among children to date have not been severe, clinicians should maintain a high index of suspicion for covid- infection in children and monitor for progression of illness, particularly among infants and children with underlying conditions. from an orthopaedic point of view, we have developed a secondary role in this pandemic; nevertheless, covid- is having a major impact on all aspects of our healthcare delivery. according to chang liang et al, clinicians have been advised to prolong the duration between non-urgent follow-ups to avoid patient overcrowding in hospitals. in their case, all patients are contacted the day before the surgical procedure and are checked for any respiratory symptoms and any risk factors or recent travel history (within days) that might put them at risk for covid- . in place of conventional meetings, both faculty and residents can remotely log on for scheduled teaching sessions online using their laptops or handheld devices. in our practice and according to our hospital covid- protocol, all our elective surgical procedure patients are tested hours prior to the procedure. patients who underwent a surgical procedure were managed according to the protocols and recommendations of the journal of bone & joint surgery and the journal of the american academy of orthopaedic surgeons. furthermore, all health workers in our hospital are tested every ten days even in absence of symptoms. in the same vein, according to farrell et al, the covid- pandemic necessities have modified our paediatric orthopaedic practice to protect patients, families and healthcare workers to minimize viral transmission. general principles include limiting procedures to urgent cases such as traumatic injuries, and deferring outpatient visits during the acute phase of the pandemic. according to the authors, nonoperative methods should be considered where possible. for patients with developmental or chronic orthopaedic conditions, it may be possible to delay treatment for two to four months without substantial detrimental long-term impact. we have established this workflow in our daily practice and non-urgent patients have been rescheduled. as our data suggests, the number of virtual consultations (this includes patients for which no complementary exams or physical examination is required) have significantly increased, including telephone consultations, video conferences and virtual platform meetings. despite the number of onsite patients who visited the outpatient clinic decreasing, if we include telemedicine, the total number of patients seen in this period of only decreased around % from the previous years (p < . ). this suggests that telemedicine is a helpful tool for monitoring patients in the clinic without any risk for the patients and caregivers or the staff. from an educational point of view, residents, fellows and consultants can continue their training with video conferences, webinars and online courses. as such, we have implemented video conferencing into our daily rounds and remote access to our computers has been allowed in order to work from home following local ethics and laws. despite our daily workflow having been substantially modified, data from our census showed that we are still efficient in our outpatient workflow with around % of the activity seen in previous years, meaning that our activity has not been completely stopped. furthermore, other activities indirectly related to patients assistance have been reinforced during this period. these kind of activities include an increase in research activities, the establishment and improvement of different protocols and clinical pathways and other office management and administration tasks. during this time, we have had periodic video conferences to plan the work for the next weeks according to daily changes in the pandemic. this has included the management of surgical waiting lists and outpatient clinics lists. despite the fact that a pandemic is considered a health disaster and as previously mentioned, it has dramatically changed our daily work, it may establish a precedent on how we will work in the future, as the use of new technologies may be more convenient and more efficient for paediatric orthopaedic surgeons. in this context, telemedicine seems to be beneficial for patients from rural and distant areas by eliminating unnecessary travel to the hospital. as telemedicine decreases pressure on our outpatient clinics, this may help to allocate other onsite visits and reduce wait times. [ ] [ ] [ ] [ ] [ ] [ ] on the other hand, although elective surgical procedures have decreased dramatically, we have still operated for urgent cases and no significant differences in the mean number of urgent surgeries were found compared with previous years. the psychological, emotional and educational consequences of the confinement (since march ) are still unknown (school lessons have also switched to virtual education as recommended by the government) but it has also reduced the number of trauma cases derived from falls in the playground, sporting activities or school activities. the fact that the biggest percentage of trauma cases ( . %) at our ed were registered in the second shift (from pm to pm) could also suggest that children have followed their daily educa-tional tasks and falls have happened after the school time (distance learning at home in this case). finally, despite the number of trauma cases reducing by . % compared with and , children triaged according to the esi triage scale showed no significant differences in severity ( . in and versus . in ) (p = . ). from all of the trauma cases admitted to the ed in this period of , patients required surgery ( . % of cases), while in and , only patients from required surgery ( . %). this suggests that despite the number of trauma and orthopaedic cases being lower in , these were more severe cases. this may also suggest that the parents and caregivers are only bringing their children to the hospital in adverse times such as this pandemic for a major cause, while in our daily practice more seemingly banal trauma is attended. despite the fact that according to these results it may seem as if paediatric hospitals are not part of the fight against this pandemic, it must be said that all paediatric and obstetrics cases from the rest of the hospitals have been sent to our hospital to free beds for adult patients with covid- related problems. it was established also that if necessary, adult trauma patients would be sent to our paediatric hospital as most of the adult third level hospitals were overcrowded and occupation of intensive care unit beds was around %. it is noteworthy to mention that countless administration and management tasks have been done as part of indirect clinical duties as well as research and education. staff members were also offered the option to give assistance to other adult and field hospitals if necessary. among study limitations, it should be noted that this is only a retrospective observational study obtained from a third level paediatric hospital, comparing only one month since the state of alarm was established in spain. the objective is to give a general view of what we are finding and how we are changing our practice in the paediatric orthopaedic field. another limitation is that we still are uncertain as to how this situation will affect our practice in the future, the effects on wait times or if it comes with hefty health resource utilization. for example, in scoliosis patients, time is crucial as a patient's deformities increase over time. previous data suggest that surgical procedures to address larger spinal deformities are associated with increased health resource utilization, such as increased operative times and blood loss. , another limitation is that our staff team formed of senior surgeons was affected by covid- ; four members were put in quarantine. in the same vein, despite our hospital being a university hospital and receiving residents and fellows from other hospitals, these were required by law to return to their home hospitals to assist with covid- patients. this meant a reduction of residents and three fellows, who added to four consultants in quarantine, meant that we went from members to only . furthermore, these consultants were divided into teams as previously mentioned. this may also have affected the significant decrease in some clinical activities. from the authors' point of view, we have to take advantage of this difficult situation to change and improve our practice in the future by adopting telemedicine as part of our practice in order to better allocate our health resources as this situation may be repeated infuture years. as the covid- pandemic has interfered in our daily practice, we have found a decrease in the number of paediatric trauma patients admitted to our ed, the number of patients visiting onsite to our paediatric orthopaedic clinic and the number of elective cases compared with other years. however, the percentage of urgent cases requiring surgery has increased. we also have found an important increase in the number of patients attending by virtual means. this could support a change in our practice in the future by implementing telemedicine as part of our daily work. to the best of our knowledge, this study shows the effect of the covid- pandemic on paediatric orthopaedics practice. no benefits in any form have been received or will be received from a commercial party related directly or indirectly to the subject of this article. this article is distributed under the terms of the creative commons attribution-non commercial . international (cc by-nc . ) licence (https://creativecommons.org/ licenses/by-nc/ . /) which permits non-commercial use, reproduction and distribution of the work without further permission provided the original work is attributed. ethical approval: no human participants and/or animals are involved in this work. institutional review board/ethics committee approval was not required as this is an official census review. no data or information was reviewed from the patient´s charts. informed consent: informed consent was not required for this study. the authors wish to acknowledge mba institute for their contribution during the covid outbreak. ap-g reports royalties from spineart, outside the submitted work. the other authors declare no conflict of interest. ap-g: data collection, manuscript preparation. lc: manuscript preparation. world health organization: novel coronavirus ( -ncov), situation report - clinical course and risk factors for mortality of adult inpatients with covid- in wuhan, china: a retrospective cohort study clinical course and mortality risk of severe covid- collaborative multidisciplinary incident command at seattle children's hospital for rapid preparatory pediatric surgery countermeasures to the covid- pandemic the impact of a major televised sporting event on emergency department census ireland in the world cup: trauma orthopaedic workloads were attendances to accident and emergency departments in england and australia influenced by the rugby world cup final football, television and emergency services the impact of the america's cup on fremantle hospital major sport championship influence on ed sex census a major sporting event does not necessarily mean an increased workload for accident and emergency departments. euro group of accident and emergency departments the effect of sporting events on emergency department attendance rates in a district general hospital in northern ireland do major televised events affect pediatric emergency department attendances or delay presentation of surgical conditions? emergency nurses association. the emergency severity index epidemiology of covid- among children in china clinical characteristics of hospitalized patients with novel coronavirus-infected pneumonia in wuhan, china covid- in children: initial characterization of the pediatric disease coronavirus disease in children -united states novel coronavirus and orthopaedic surgery: early experiences from singapore preparing to perform trauma and orthopaedic surgery on patients with covid- peri-operative considerations in urgent surgical care of suspected and confirmed covid- orthopedic patients: operating rooms protocols and recommendations in the current covid- pandemic recommendations for the care of pediatric orthopedic patients during the covid pandemic effectiveness of an interactive virtual telerehabilitation system in patients after total knee arthoplasty: a randomized controlled trial internet-based outpatient telerehabilitation for patients following total knee arthroplasty: a randomized controlled trial enhanced reality showing long-lasting analgesia after total knee arthroplasty: prospective influence of structured telephone follow-up on patient compliance with rehabilitation after total knee arthroplasty the future of e-learning in medical education: current trend and future opportunity teleconferencing in medical education: a useful tool emergency severity index (esi): a triage tool for emergency department care, version . implementation handbook is larger scoliosis curve magnitude associated with increased perioperative health-care resource utilization?: a multicenter analysis of adolescent idiopathic scoliosis curves empirically derived maximal acceptable wait time for surgery to treat adolescent idiopathic scoliosis key: cord- - fjr qr authors: perlman, yael; yechiali, uri title: reducing risk of infection - the covid- queueing game date: - - journal: saf sci doi: . /j.ssci. . sha: doc_id: cord_uid: fjr qr the covid- pandemic has forced numerous businesses such as department stores and supermarkets to limit the number of shoppers inside the store at any given time to minimize infection rates. we construct and analyze two models designed to optimize queue sizes and customer waiting times to ensure safety. in both models, customers arrive randomly at the store and, after receiving permission to enter, pass through two service phases: shopping and payment. each customer spends a random period of time shopping (first phase) and then proceeds to the payment area of the store (second phase) where cashiers are assigned to serve customers. we propose a novel approach by which to calculate the risk of a customer being infected while queueing outside the store, while shopping, and while checking out with a cashier. the risk is proportional to the second factorial moment of the number of customers occupying the space in each phase of the shopping route. we derive equilibrium strategies for a stackelberg game in which the authority acts as a leader who first chooses the maximum number of customers allowed inside the store to minimize the risk of infection. in the first model, store’ management chooses the number of cashiers to provide to minimize its operational costs and its customers’ implied waiting costs based on the number allowed in the store. in the second model, the store partitions its total space into two separate areas – one for shoppers and one for the cashiers and payers – to increase cashiers’ safety. our findings and analysis are useful and applicable for authorities and businesses alike in their efforts to protect both customers and employees while reducing associated costs. the covid- pandemic has upended nearly every type of business, but one of the more dramatic impacts early in the pandemic has been long queues of customers outside supermarkets. the need to reduce crowding and queueing has forced store managers worldwide to implement regulations such as "maximum shoppers at store" and "maximum number of customers in checkout areas." according to a recent report by bbc news ( ), supermarkets in the united kingdom are posting staff members at the doors who allow a maximum of shoppers inside the store at a time. israeli i news ( ) reports that all business owners in israel are required to place signs at store entrances specifying the maximum number of people allowed on the premises. as a result of these maximum headcount policies, sometimes-long queues are forming outside these businesses. evidence is growing that infection with covid- is associated with prolonged periods of exposure to infected individuals (cnn, ) . the longer one spends in a crowded environment with people who may be infected, the greater the risk of getting sick (barr, ) . therefore, greater numbers of people in stores and queues increase the likelihood of infection and make it difficult to ensure that social distancing restrictions are maintained (long et al., ) . customers waiting outside of stores can maintain adequate social distance but are still potentially exposed to infection (gupta et al., ) . it is also critical to keep workers safe during the pandemic, and greater efforts are needed to reduce the number of customers waiting at cashier areas to allow employees to maintain social distance from customers (see, e.g., government of the united kingdom ( )). thus, it is critical to control and manage queue sizes and customer waiting times to ensure the safety of customers and workers. systems for reducing the number customers in retail stores and reducing waiting times have been studied intensively in the literature using queueing theory. recently, several queueing models have been constructed to study the impact of the covid- pandemic (see, e.g., kaplan, ; alban et al., , long et al., . in this paper, we construct and analyze two models to optimize queueing when there is a limit on the number of customers allowed inside a store. customers arrive randomly at the store and, after receiving permission to enter, pass through two service phases: shopping and payment. each customer first spends a random period of time shopping (first phase) and then proceeds to the payment area of the store (second phase) where cashiers are assigned to serve customers. when the store is occupied by the maximum number of allowed customers, newly arriving customers wait outside in line until permitted to enter. in the first model, after completing shopping, the customer proceeds to the payment phase and either is served immediately by a free cashier or waits in a line forming in front of the cashiers. in the second model, we analyze reducing waiting time in the payment queue (and ensuring the safety of cashiers and customers) by allowing store management to set aside a separate waiting space with limited capacity adjacent to the cashiers. for each model, we derive the resulting stability condition and obtain closed form expressions for mean queue sizes and mean waiting times. to address the issues of controlling and managing queue sizes to ensure the safety of customers and workers while reducing a store's associated costs, we employ a game-theoretic model within the queueing framework to investigate equilibrium strategies in terms of capacity and number of servers. in the game, the authority chooses a maximum number of customers allowed inside the store at a time to minimize the risk of transmission. we propose a novel measure by which to evaluate the risk of infection in this scenario. when l customers are present in the store, each customer can be infected by (the other) l - customers. safety problem by developing nonclassical multi-server queueing models involving two service phases, a limit on the number of customers in each phase, and internal blocking and delays. we derive and interpret the corresponding stability conditions. second, we construct a measure to estimate customers' mean risk of infection. the risk is proportional to the second factorial moment of the number of customers occupying the space in each phase of the shopping route. third, within the queueing framework, we formulate a game-theoretic model to investigate equilibrium strategies in terms of capacity and number of servers. insights from our models are useful to government authorities and businesses as they determine a safe number of customers and employees while containing the associated costs of restrictions to businesses. in the m-model, customers arrive at a store such as a supermarket or department store according to a poisson process with rate λ. the number of cashiers in the store is , each with a service c  duration exponentially distributed with mean /µ. the authorities have imposed a limit of m ≥ c on the maximum number of customers allowed inside the store at one time. a customer who arrives when there are m customers already in the store must wait outside in a line (an unlimited queue) until permitted to enter. after entering, each customer passes through two "service" phases -shopping and payment. in the first phase, the customer spends a random amount of time exponentially distributed with mean /ξ. upon completing shopping, the customer proceeds to the payment phase and either is served immediately by a free cashier or waits in a line forming in front of the cashiers. shopping times, payment times, and the arrival process are independent. this two-phase service process can appear to be a two-site tandem network (see, e.g., perlman and yechiali ( )), but that is not the case since the two service stages are dependent via the imposed upper limit on the total number of customers allowed inside the store in both phases. in addition, in the current model, the time deterioration considered in perlman and yechiali ( ) is replaced by the increased risk of infection associated with customer crowding. let l denote the number of customers either in the shopping phase or waiting outside the store. note that l is unbounded (since we do not restrict the number of customers allowed to wait outside). let l denote the number of customers in the payment phase who are either being served by a cashier or are waiting in line for a free cashier. the system can be formulated as a two-dimensional quality-by-design process with a state-space of   customers in the shopping phase. figure presents a transition-rate diagram of the m-model system. where q is the so-called infinitesimal generator matrix, is a row vector of zeros, and is a e column vector of ones. the generator matrix q is given by where is the element in column n and row k. all other elements equal zero; …… and the matrices a , a , and a are square matrices of order (m + ) as follows: matrix a defines the underlying process of the system, which is depicted in figure as a linear markov process with m + states. in fact, the process defined by a can be considered as a machine repair problem in which repairmen maintain machines. the lifetime of c  m c  each machine is exponentially distributed with parameter and the repair time of a machine by a  repairman is exponential with intensity. the state of the system is the number of unbroken μ machines. denote by the stationary probability vector satisfying and ( , ,..., ) the system stability condition (neuts, ) is , and the system is stable if and next, we explore the stability condition stated in proposition and provide intuition for the results. in particular, the righthand side of the stability condition approaches as μ c  approaches infinity, and the system then becomes erlang's delay queue: an m(λ) / m( ) / c queue μ with arrival rate λ and c parallel servers who each serve at rate . the stability condition is then μ λ μ. c  as approaches infinity, the mean total service time for each costumer is so the μ /  system becomes an erlang's delay queue with parallel servers and the stability condition is m similarly, when m approaches infinity, the shopping phase becomes an m(λ) / m( ) /   queue with an output poisson process (λ). therefore, the payment phase becomes an server (cashier), the right-hand side of the stability condition equals where is the incomplete gamma function. ( ) as in neuts ( ), a rate matrix r exists that satisfies in which the probability vectors satisfy r is the minimal non-negative solution of equation and is usually calculated numerically via a successive substitution algorithm (see, e.g., harchol-balter ( )). see hanukov and yechiali ( ) for cases in which matrix r can be expressed explicitly. by equations and , the balance equations are the equations , , and uniquely determine the boundary probability vectors since the risk of infection increases as the number of customers in a store increases, we estimate the risk of infection as proportional to . that is, when l customers are present in the store, each one can be infected by the other customers. also, the risk of infection is lower in more-open areas and could depend on whether social distance can be and is maintained. therefore, in this section we calculate the combined risk of infection for customers along their waiting, shopping, and payment routes (phases). let l shop denote the number of shoppers and let l out denote the number of customers lining up outside the store with l = l shop + l out . we start by calculating . the risk to shoppers of being infected is considered proportional to . the risk of customers waiting outside the store is proportional to . , , ,..., , , ,..., , , , ,..., ; , , ,..., ,( ) ,( ) ,..., , , , ,..., ; be four sets of column vectors, each of size . then, when the system is stable, the mean rate of arrival rate at the shopping phase is . λ similarly, the mean rate of arrival at the payment phase is also . applying little's law, one can λ calculate the mean waiting times: . the nm-model: maximum number of customers at the cashier area and shopping inside the store we now model a setting in which the store sets aside a separate waiting space near the cashiers for at most n ≥ customers in the payment phase and those customers wait in a single line in the designated area (if necessary) to be admitted by the next free cashier. as in the m-model, the authorities in this model have imposed a limit, m, on the total number of customers allowed inside the store. thus, in this setting, the store is divided into two separate areas: (i) the payment area with c ≥ parallel cashiers and waiting space of size n customers and (ii) the shopping area, in which the maximum number of customers allowed, k. thus, the maximum number of customers is the store is . = + + in this model, a customer first spends a random period of time in the shopping phase/area that is exponentially distributed with mean /ξ and then proceeds to the payment phase/area. when there are exactly c + n payers, this customer "orbits" in the shopping area for another exponentially distributed period of time with the same mean /ξ (see, e.g., avrachenkov et al., ; perel and yechiali, ) . otherwise, when the number of customers in the cashier area is less than c + n, the customer enters the cashier area and becomes a payer. when at least one of the cashiers is not busy, the payer proceeds directly to a cashier and pays. as in the m-model, the payment (service) duration is exponentially distributed with mean /µ. all shopping, service, and orbit times are independent of each other and independent of the arrival process. let l shop denote the number of customers in the shopping area and let l out denote the number of customers lining up outside the store. let l = l shop + l out . let l denote the number of customers in the cashier area. then, the system's state-space is .   ( , ) ( , ) : , , , ,...; , , , ,..., the transition-rate diagram for this system is depicted in figure . matrix a defines the underlying process of this system and defines a truncated erlang's model of poisson arrival with rate , c parallel exponential servers each with rate µ, and an additional k waiting buffer of size n. figure presents its transition-rate diagram. . corollary specifies the value of for some special cases. as in the m-model, the risk of customers being infected at each area is estimated as , respectively. the following proposition shows how these measures are calculated. (iv) similar to the proof of proposition . ■ again, if the system is stable, the mean rate of arrival to the shopping area is . similarly, λ the mean rate of arrival to the cashier area is also . applying little's law, the mean waiting λ time outside the store is . the mean waiting time at the shopping area is and the mean waiting time at the payer area is we next study the effect of the two models' parameters on the performance measures. let the arrival rate, service rate, and shopping rate, respectively, be customers λ , μ , and     per hour. set m = , n = , and c = so that the maximum number of customers in the shopping area is for these values, the right-hand side of the stability condition is . cashiers increases the service rate at the payment area. customers can proceed more rapidly through the payment phase, reducing the amount of time spent waiting in the designated area and reducing the number of customers who must orbit in the shopping area. the smallest value of n (the number of payers who can occupy the payment space) that sustains the stability condition is . when the value of n increases, as depicted in figure b, out e w several studies have adopted game-theoretic approaches when analyzing safety-related events (see winkler et al. ( ) for a review). we construct and analyze a covid- queueing game between the authority aiming to reduce the risk of infection while keeping customers and workers safe and a businesses that wants to minimize its costs. specifically, the authority chooses the maximum number of customers allowed in the store, m. the objective of the authority is ( )   where denotes the weight the authority assigns to each measure of risk. it is reasonable to α i assume that the authority assigns a smaller weight to queues forming outside a store; therefore, in this case, receives the smallest weighting. store management cannot exceed the limit set by the store's equilibrium strategy also depends on the values of the weights assigned by the authority to each measure of risk. set . these weights reflect the icu capacity management during the covid- pandemic using a stochastic process simulation a retrial system with two input streams and two orbit queues the covid- crisis and the need for suitable face masks for the general population keeping workers and customers safe during covid- in restaurants enabling and enforcing social distancing measures using smart city and its infrastructures: a covid- use case explicit solutions for continuous-time qbd processes by using relations between matrix geometric analysis and the probability generating functions method performance modeling and design of computer systems: queueing theory in action om forum-covid- scratch models to support local decisions pooling and balking: decisions on covid- available at ssrn the israeli queue with retrials on tandem stochastic networks with time-deteriorating product quality reporting near-miss safety events: impacts and decision-making analysis in this paper, we construct and analyze two special nonclassical multi-server queueing models to control queueing problems generated by social distancing constraints associated with the covid- pandemic such as "maximum shoppers at store" and "maximum number of customers in checkout area." in both models, a capacity constraint is imposed that limits the number of customers allowed inside the store at one time. the governing authority that imposes the limit acts as a stackelberg leader in choosing how many customers will be allowed. then, store management chooses the number of cashiers to employ to reduce its costs in terms of cashier salaries and the cost of dissatisfaction from customers waiting in line. in the second model, store management can also choose a maximum number of customers who can occupy a separate payment area in a queue formed in front of the cashiers. for each model, we derive and analyze the equilibrium strategies in terms of the store's customer capacity and the number of cashiers.our findings are useful and applicable for both government authorities establishing restrictions key: cord- -vmsq hhz authors: rodriguez, jorge; acuna, juan m; uratani, joao m; paton, mauricio title: a mechanistic population balance model to evaluate the impact of interventions on infectious disease outbreaks: case for covid date: - - journal: nan doi: . / . . . sha: doc_id: cord_uid: vmsq hhz infectious diseases, especially when new and highly contagious, could be devastating producing epidemic outbreaks and pandemics. predicting the outcomes of such events in relation to possible interventions is crucial for societal and healthcare planning and forecasting of resource needs. deterministic and mechanistic models can capture the main known phenomena of epidemics while also allowing for a meaningful interpretation of results. in this work a deterministic mechanistic population balance model was developed. the model describes individuals in a population by infection stage and age group. the population is treated as in a close well mixed community with no migrations. infection rates and clinical and epidemiological information govern the transitions between stages of the disease. the present model provides a steppingstone to build upon and its current low complexity retains accessibility to non experts and policy makers to comprehend the variables and phenomena at play. the impact of specific interventions on the outbreak time course, number of cases and outcome of fatalities were evaluated including that of available critical care. data available from the covid outbreak as of early april was used. key findings in our results indicate that (i) universal social isolation measures appear effective in reducing total fatalities only if they are strict and the number of daily social interactions is reduced to very low numbers; (ii) selective isolation of only the elderly (at higher fatality risk) appears almost as effective in reducing total fatalities but at a much lower economic damage; (iii) an increase in the number of critical care beds could save up to eight lives per extra bed in a million population with the current parameters used; (iv) the use of protective equipment (ppe) appears effective to dramatically reduce total fatalities when implemented extensively and in a high degree; (v) infection recognition through random testing of the population, accompanied by subsequent (self) isolation of infected aware individuals, can dramatically reduce the total fatalities but only if conducted extensively to almost the entire population and sustained over time; (vi) ending isolation measures while r values remain above . (with a safety factor) renders the isolation measures useless and total fatality numbers return to values as if nothing was ever done; (vii) ending the isolation measures for only the population under y/o at r values still above . increases total fatalities but only around half as much as if isolation ends for everyone; (viii) a threshold value, equivalent to that for r , appears to exist for the daily fatality rate at which to end isolation measures, this is significant as the fatality rate is (unlike r ) very accurately known. any interpretation of these results for the covid outbreak predictions and interventions should be considered only qualitatively at this stage due to the low confidence (lack of complete and valid data) on the parameter values available at the time of writing. any quantitative interpretation of the results must be accompanied with a critical discussion in terms of the model limitations and its frame of application. the impact of specific interventions on the outbreak time course, number of cases and outcome of fatalities were evaluated including that of available critical care. data available from the covid outbreak as of early april was used. key findings in our results indicate that (i) universal social isolation measures appear effective in reducing total fatalities only if they are strict and the number of daily social interactions is reduced to very low numbers; (ii) selective isolation of only the elderly (at higher fatality risk) appears almost as effective in reducing total fatalities but at a much lower economic damage; (iii) an increase in the number of critical care beds could save up to eight lives per extra bed in a million population with the current parameters used; (iv) the use of protective equipment (ppe) appears effective to dramatically reduce total fatalities when implemented extensively and in a high degree; (v) infection recognition through random testing of the population, accompanied by subsequent (self) isolation of infected aware individuals, can dramatically reduce the total fatalities but only if conducted extensively to almost the entire population and sustained over time; (vi) ending isolation measures while r values remain above . (with a safety factor) renders the isolation measures useless and total fatality numbers return to values as if nothing was ever done; (vii) ending the isolation measures for only the population under y/o at r values still above . increases total fatalities but only around half as much as if isolation ends for everyone; (viii) a threshold value, equivalent to that for r , appears to exist for the daily fatality rate at which to end isolation measures, this is significant as the fatality rate is (unlike r ) very accurately known. any interpretation of these results for the covid outbreak predictions and interventions should be considered only qualitatively at this stage due to the low confidence (lack of complete and valid data) on the parameter values available at the time of writing. any quantitative interpretation of the results must be accompanied with a critical discussion in terms of the model limitations and its frame of application. understanding the potential spread of diseases using mathematical modelling approaches has a long history. deterministic epidemic models published in the early th century already demonstrated the importance of understanding the population-based dynamics as well as potential parameters of interest therein (kermack & mckendrick, ) . numerous modelling approaches are available for the prediction of propagation of infectious diseases (may & anderson, ; capasso & wilson, ; hethcote, ; mccallum et al., ; ruan & wang, ; li et al., ; keeling & eames, ; grassly & fraser, ; keeling & rohani, ; balcan et al., ; britton, ; funk et al., ; gray et al., ; brauer et al., ; miller et al., ; siettos & russo, ; pastor-satorras et al., ) . their outputs inform studies on health projections and play an important role in shaping policies related to public health (murray and lopez, a , b , c , and d ferguson et al., ) . data availability has greatly increased in recent years, which led to direct improvements in epidemiological models (colizza et al., ; riley, ; siettos & russo, ) . these models provided a more comprehensive understanding of recent outbreaks of diseases such as ebola (gomes et al., ; who ebola response team, ) and zika (zhang et al., ) . however, all modelling efforts are highly dependent on several elements: a comprehensive algorithm of clinical and public health true options and stages of events; probability of such options given certain conditions of the system; identification of parameters that reflect such events and their probabilities (such as mortality by age, infectiousness by contacts, etc.); assumptions for parameters with insufficient data; and valid data for those parameters that allow the calibration and posterior validation of the forecasts (tizzoni et al., ) . in viral pandemics in particular, one of those parameters, the direct estimation of infected subpopulation fractions, is not feasible using available epidemiological data (unless universal, highly sensitive is testing used, rarely possible to implement in these situations), particularly if very mild cases, asymptomatic infections or pre-symptomatic transmission are observed or expected. this was the case of the previous influenza a (h n - ) pandemic and it is the observation for covid- pandemic (russel et al., ) . thus, in many cases, modelling uses a combination of the best available data from historical events and datasets, parameter estimation and assumptions. then, data about these parameters are computed with statistical tools for the development of epidemic models (cooper et al., ; biggerstaff et al., ) . the most challenging phase for the understanding of the potential spread of a disease is when novel disease outbreaks emerge in global populations (anderson & may, ) , in which data availability is limited (e.g. novelty of pathogen; delay of communication of case datasets from public health workers and facilities to researchers) or biased by external factors (e.g., limited availability of testing capacity; undefined or partially defined diagnostics for disease). with novel diseasespecific epidemic models, the development of models with the sufficient level of low complexity and meaningful parameters, that can be identified with data as the infection progresses and data become more available, is posited as a potential tool to inform public health policy and impact mitigation strategies (berezovskaya et al., ; hall et al., ; bettencourt et al., ; nishiura, ; wang & zhao, ; lee et al., ; nsoesie et al., , chowell et al., , rivers et al., . the parameters of these models can be identified with data as the infection progresses and data become more available. the covid- outbreak and posterior pandemic has brought unprecedented attention into these kinds of modelling approach limitations, with multiple epidemic models and disease spread forecasts being published as more data becomes available. these models have evaluated the ongoing course of the disease spread evolution, from the earlier dynamics of transmission from initial cases , to the potential of non-pharmaceutical interventions to limit the disease spread, such as: international travel restrictions (chinazzi et al., ) , contact tracing and isolation of infected individuals at onset (hellewell et al., ) , different scales of social distancing and isolation (flaxman et al., , prem et al., . other statistical models tried to estimate fundamental characteristics (i.e. potential model parameters) for the disease, such as the incubation period and basic reproduction number, r , as well as to assess short-term forecasts . given the inherent uncertainty associated with most of the parameters used, a stochastic approach is employed in the above models. effective communication between health care and public health systems and science hubs is considered one of the bigger challenges in both health sciences and public health (zarcadoolas, ; squiers et al., ) . in health care it is not only necessary to take effective measures but also to do it timely. this requires strategies for data sharing, generation of information and knowledge and timely dissemination of such knowledge for effective implementations. the development of strategies for interaction under the general, and correct assumption of low literacy health communication paradigms are especially relevant (plimpton & root, ) . and we have good evidence that health illiteracy influences greatly health behaviours that in turn, are likely to play a role in influencing the degree of the effectiveness of such interventions. given the complexity and the expected short and long-lasting impacts that these public health interventions should have when dealing with disease outbreaks and pandemics (reluga, ; fenichel et al., ) sufficiently complex but user-accessible modelling tools should provide researchers, public health authorities, and the general public with useful information to act in moments of clear and wide uncertainty. in order to work properly they require access to up-to-date data, in this case of the covid- spread (dong et al., ) . additionally, simple and interactive models can contribute to the understanding by broader audiences of what to expect on the propagation of infectious diseases and how specific interventions may help. this increased awareness of the disease behaviour and potential course in time by public and policy makers can directly and positively impact the outcome of epidemic outbreaks (funk et al., ) . population balance models are widely use in disciplines such as chemical engineering to describe the evolution of a population of particles (henze et al., ; ramkrishna and singh, ; yang, ; gonzález-peñas et al., ) . these types of models describe the variation over time of, socalled, state variables as functions of state transition equations governed by transport processes, chemical reactions or any type of change rate from one state to another. such models allow for the description of the underlying processes in a mechanistic manner, maintaining therefore a direct interpretation of the model behaviour. if the state transition rates are defined in mechanistic manner and with meaningful parameters, such models can describe a process in a way that is interpretable into reality and open the possibility not only of prediction but also of hypothesis generation when data deviate from model predictions. the present work attempts to provide a deterministic population balance-based model with a sufficient minimum, but clinically and public health robust, set of mechanistic and interpretable parameters and variables. the model aims at improving the understanding of the major phenomena involved and of the impacts in the system's resources and needs of several possible interventions. the model level of complexity is targeted such that retains mechanistic meaning of all variables and parameters, captures the major phenomena at play and specifically allows for accessibility to nonexperts and policy makers to comprehend the variables at play. i this way expert advice and decision making can be brought closer together to help guide interventions for immediate and longer-term needs. the model presented is based on balances of individuals transitioning between infection stages and segregated by age group. all individuals are placed in a common single domain or close community (e.g. a well-mixed city or town), no geographical clustering nor separation of any type is considered, neither is any form of migration in or out of the community. big cities with ample use of public transportation are thought to be settings best described by the model. the model also provides a direct estimation of the r (reproduction number or reproductive rate) (delamater et al. ) under different circumstances of individual characteristics (such personal protection or awareness) as well as under population-based interventions (such as imposed social isolation). r is a dynamic number often quoted erroneously as a constant for a specific microorganism or disease. the ability to estimate the r for different times of the outbreak (given the interventions), outbreak settings and interventions is considered to be a valuable model characteristic. r is predicted to change over time with interventions that do not produce immune subjects (such as isolation or use of personal protection equipment (ppe), as opposed to vaccination). however, in many instances over the course of an outbreak, r is consistently estimated as a constant, frequently overestimating and not allowing correct estimations of the course of events. the model solves dynamic variables or states. every individual belongs, in addition to their age group (which she/he never leaves), to only one of the possible states that correspond to stages of the infection, namely: healthy-non-susceptible (hn); healthy susceptible (h); pre-symptomatic (ps); symptomatic (s); hospitalised (sh); critical (sc) (with and without available intensive care); deceased (d) and recovered immune (r) . definitions of the model states are shown in table . each variable is a state vector with the number of individuals in that stage per age group. age groups are defined per decade from - until + year olds. nine age groups are defined in the model, each state is therefore a vector of dimensions x , the total number of states is a matrix of dimensions x . the transitions between these states are governed by rates of infection and transition as defined in table . note that vector variables and parameters are represented in bold font and scalar ones in regular font. a schematic representation of the modelling approach with the population groups considered for the infection stages, rates of infection and transition between groups and showing possible interactions between population groups is shown in figure . two main interventions are described in the model that have are being used currently slow the spread of the covid disease outbreak (i) the degree of social isolation of the individuals in the population in terms of the average number of random interaction individuals have per day with others that are also interacting and (ii) the level of personal protection and awareness that individuals have to protect themselves and others against contagion or spread during interactions. these interventions can be stratified by age groups. table describes the key parameters that define the interventions. the degree of isolation is described by a parameter (nih) (vector per age group) corresponding to a representative average number of daily interactions that healthy susceptible individuals have with others. different nih values can be assigned per age group to describe the impact of diverse isolation strategies selective to age group such as e.g. selective isolation of the elderly and/or young. the level of use of ppe and awareness described by the parameters (lpah) for healthy and (lpaps and lpas) for infectious individuals (both in vectors per age group). values of the lpa parameters can vary between and , with corresponding to the use of complete protective measures and zero to the most reckless opposite situation. an additional reduction factor is defined for decreased social interactivity of infectious individuals both for symptomatic (rfis), to describe e.g. self or imposed isolation of s individuals and for pre-symptomatic (rfips), to describe e.g. awareness of their infection if extensive random testing of the population is implemented. the infection of healthy susceptible individuals (h) is modelled as to occur only by interaction between them and other infected either pre-symptomatic (ps) or symptomatic (s) individuals. infected hospitalised (sh) and critical (sc) are assumed not available for infectious interactions neither are those deceased (d). two rates of infection of healthy susceptible individuals (in number of infections per day) are defined, one from each one of the two possible infecting groups (ps and s). the rates of infection are vectors per age group from the product of (i) the fraction of interactions with ps (or s) among the total interactions (fips or fis) times (ii) the probability of contagion in an interaction with ps (or s) (pi_ps or pi_s) (per age group), (iii) the average number of daily interactions that h individuals have (nih) and (iv) the number of h individuals themselves (per age group) (see eqs .a-b). note that point operators between vectors indicate an operation element-by-element. the probabilities of infection per interaction are calculated as per eq. .c-d.. the average rates of transition between states are defined such that epidemiological and clinical available data can be used such as the proportion of individuals that transition or recover (see table ) and the average times reported at each stage before transition or recovery (see table ). table . epidemiological parameters (all in vectors per age group). fraction of population non-susceptible to infection fhn_t #hs/#h fraction of ps that will become s ( -fr_ps) fs_ps #s/#ps fraction of s that will become sh ( -fr_ss) fsh_s #sh/#s fraction of sh that will become sc ( -fr_sh) fsc_sh #sc/#sh fraction of cared sc that will die into d ( -fr_sc) fd_sc #d/#scic fraction of ps that will recover into r fr_ps #r/#ps fraction of s that will recover into r fr_s #r/#s fraction of sh that will recover into r fr_sh #r/#sh fraction of sc with critical care that will recover into r fr_sc #r/#scic calculated, not an input parameter the rates of individuals transitioning between stages (in number of individuals per day) are described in eqs .a-e. all rates are vectors per age group of dimensions ( x ). in order to describe the possible shortage of critical care resources, critical individuals are distributed between those with available intensive care (nsc_ic) and those without available intensive care (nsc_nc). at each simulation time step nsc_ic and nsc_nc are computed via an allocation function of critical care resources over the total nsc per age group. the function allocates resources with priority to lower age groups until the maximum number of intensive care units is reached. all critical individuals with no available intensive care nsc_nc are assumed to become deceased after td_nc. the rate of transition from critical to deceased is therefore the sum of that of those with available care (rd_scic) plus that of those without available care (rd_scnc) as per eqs. .e-g. the rates of individuals recovering from the different infected stages (in number of individuals per day) are described in eqs .a-d. (all rates in vectors per age group). the state transitions as governed by these rates are represented in a matrix form in figure a . since the model produces instant values of outputs over time based on the parameters used, the reproduction number (r ) simulated must be considered as an instantaneous estimation of the r (delamater et al. ) . the model structure allows for several parameters to influence r including the duration of infectious stages for the virus known so far; the potential infection of others by the virus from those infected; the probabilities of infection per social interaction; and other parameters such social isolation and use of ppe. elements such as the number of recovered immune individuals should not directly affect r as the reproductive number refers to only the potential infection of susceptible individuals from infected individuals. dynamic reproduction number (r ) during the outbreak (delamater et al. ) is computed over time from the model state variables according to eq. . under this approach, infectious individuals can only infect others while they are in pre-symptomatic (ps) and symptomatic (s) stages. although it is known that post-symptomatic recovered individuals may be infectious for some period of time, this has not been considered in the model at this time due to lack of data. hospitalised and critical individuals are assumed to be well isolated and also not able to infect others. the provided dynamic output of the reproduction number r can be used to guide and interpret the impact of interventions in terms of r . modelled infected individuals can take only three possible infectious paths namely: (i) ps  r; (ii) ps  s  r and (iii) ps  s  sh. these paths are made of combinations of four possible infectious stage intervals in which infected individuals spend time and infect at their corresponding rate (see table ). the model presented is deterministic and based on population balances of individuals classified by their stage of infection and age group only, no other differentiation within those groups is captured by this version of the model. this characteristic allows for the model application to single, densely populated clusters. the model has low complexity and requires a small number of mechanistic meaningful parameters, most of which can be directly estimated from epidemiological and clinical data. the model however carries limitations in its prediction capabilities due to the fact that all variables and parameters refer to representative averages for each stage and age group population. this may limit the model representation of the non-linear interactions in the real system and therefore at this stage any results interpretation for prediction purposes should be critically discussed against these limitations. a case study based on a scenario of propagation of the covid pandemic using data available as of april is presented below. the results obtained are intended to be interpreted qualitatively and to be contextualized to the specific setting characteristics. they intend to serve as a demonstration of the model potential if applied with higher confidence parameter values. a number of selected scenarios aimed at illustrating the impact of different interventions were simulated. conclusions should be taken qualitatively at this stage given the low confidence in some parameter values. default reference epidemiological and clinical parameter values were obtained from different information sources on the covid outbreak as available in early april . details of values and sources are provided in the appendix tables a -a respectively, with indication of level of confidence. a population with an age distribution as that of the region of madrid (spain) in was used (ine spain, ). default reference intervention parameters were selected arbitrarily for a situation assimilated to that previous to the outbreak and without any specific intervention (see values and rationale in appendix table a ). the dynamic simulation results of the default outbreak scenario with no intervention is shown in in the appendix figure a . all scenarios are simulated for days and evaluated in terms of (i) final total number of fatalities at outbreak termination and (ii) final number of fatalities per age group. in addition, the scenarios are presented also in terms of dynamic profiles over time for (iii) number of active cases; (iv) reproduction number; (iv) number of critical cases; (v) number of fatalities. in this scenario, the impact of different imposed degrees of universal social isolation was evaluated. the parameter that describes this intervention is the average number of daily social interactions that healthy susceptible individuals have (nih). as indicated above, evidence suggests that during viral infections that behave such as covid- , the number of personal contacts increases the likelihood of infection linearly. in this scenario, the isolation measures are applied equally across all age groups and the same nih values applied to all. figure illustrates the model predictions for this scenario, in terms of the output variables indicated and in absence of any other interventions. as it can be observed in figure (top left), the overall risk of dying from the virus increases as the average number of daily social interactions (nih) increases. however, it seems to plateau at around interactions per day, suggesting that a specific a critical value may exist for nih for the intervention to succeed at lowering the final number of deaths. once age is placed in the equation, mortality behaves similarly only for those at ages over (figure (top right) ). interestingly, the nih does not appear to significantly modify mortality beyond a single interaction per day. this suggests that for younger than , interactions, in order to decrease mortality should be lower to : complete social distancing and isolation and that based only on social interactions, most of the mortality decreased by partial social isolation will be in those older than years of age. the number of fatalities appears clearly and directly related to social isolation as well as the speed at which the fatalities saturation will occur, figure (bottom right). the model is capable to capture this due to its description of the saturation of the healthcare capacity and withdrawal of critical care over capacity. the middle and bottom graphs in figure show the impact of nih on the time course of several variables. figure (middle left) supports the "flatten the curve" concept, now globally popular. if interactions are not modified, the number of cases grows rapidly, exponentially and explosively. the impact of imposed social isolation selective to those over years old is evaluated in this scenario. the parameter that describes this intervention is the average number of daily social interactions with other people that healthy susceptible individuals within the age groups over years old have (nih). in this scenario, the isolation measures are applied selectively only to the elderly. figure illustrates the model predictions for this scenario, in terms of the output variables indicated, in absence of any other interventions. as shown in figure (top left) the selective social isolation of the elderly has a potentially very significant impact on final total fatalities at an almost comparable level than the previous scenario of universal isolation. this is a result with potentially significant consequences as it indicates that a sustained isolation selective only to the elderly and not to the other age groups could alleviate the economic damage at small numbers of increased total fatalities. the decrease in social interactions in schools and colleges by isolation of the young may however have an impact on the overall multiplier of infections from youngsters to adults. the impact of selective imposed social isolation of those under years old is evaluated in this scenario. the parameter that describes this intervention is the average number of daily social interactions with other people that healthy susceptible individuals of the age groups under years old have (nih). in this scenario, isolation measures are applied only to the youngsters. figure shows the results for this scenario, for the output variables indicated, in absence of other interventions. the young population have been observed to be quite resistant to the disease. theoretically at least, young, unaffected lungs tolerate and defend better from the viral load. the isolation of the young produces no effect in the overall final fatality rate but produces a moderate impact on the mortality of the elderly at low values of nih. as it can be seen in figure , social isolation of the young has little impact producing almost identical curves for any levels of social isolation. it is thought however that the decrease in social interactions in schools and colleges by isolation of the young may have a large impact on the overall multiplier of infections from youngsters to adults. this emergent aspect of the disease spread behaviour and containment efforts is captured in our results, even though the present model does not incorporate geographical features and does not explicitly describe location-specific population interactions (such as those synthetic location-specific contact patterns in prem et al., ) . the impact of selective imposed social isolation to both those under and those over years old is evaluated in this scenario. the parameter that describes this intervention is the average number of daily social interactions with other people that healthy susceptible individuals of the age groups under and over years old have (nih). in this scenario, the isolation measures are applied selectively only to the youngsters and the elderly. figure illustrates the model predictions for this scenario, in terms of the output variables indicated, in absence of any other interventions. many of the early interventions during the covid outbreak started by protecting the elderly and isolating the young (no schools, no colleges or universities for students), decreasing the number of interactions of the two subpopulations substantially. the isolation of these population groups together results similarly to that of the isolation of the elderly alone with no significant added value in isolating the young respect to that the elderly alone as shown in figure . the impact of the availability of intensive care beds is evaluated in this scenario. the parameter that describes this intervention is the number of available intensive care beds per million population. figure illustrates the model predictions for this scenario, in terms of the output variables indicated, in absence of any other interventions. figure (top left) shows the enormous impact in decreasing total fatalities that the increase in critical care resources can have, the higher the availability of critical beds, the lesser the fatality rate. the trend applies until there is no shortage of ic beds and fatalities are the unavoidable ones. this intervention avoids those deaths that are preventable by the availability of ventilators (mainly) and critical care support. with the current parameter values in a million population it appears that around lives could be saved per additional intensive care bed. the impact of increased use of ppe and behavioural awareness is evaluated in this scenario. the parameters that describe this intervention is a factor increasing the default values (see table a ) of the lpa parameters of the healthy and infected population groups (laph, lapps and lpas). increases in these parameters decrease the probability of infection per interaction (see eqs. ) and subsequently the rates of infection (eqs. ) . figure illustrates the model predictions for this scenario, in terms of the output variables indicated, in absence of any other interventions. (table a ) as it is shown in figure the extensive use of ppe appears as potentially having a major impact on total outbreak fatalities at the highest levels of protection. there is an inverse relationship between the level of protection and the overall fatality of the disease. the peak of number of cases is reached earlier and is higher if low levels of personal protection, the infectability and r follow the same pattern. the peak number of critical cases is also decreased and slowed through time. the impact of increasing the number of tests to all population is evaluated in this scenario. widespread testing will increase tests done also to both infected pre-symptomatic and symptomatic individuals. the parameter that describes this intervention is a reduction factor in social interactions due to knowledge of infection by ps and s individuals (rfips and rfis). reduced values of rfips and rfis decrease the fraction among of interactions among total from both ps and s (see eq. .a-b) and therefore the rates of infection by these two groups (eq. .a-b). the impact of applying an isolation reduction factor to the default rfi values is evaluated. figure illustrates the model predictions for this scenario, in terms of the output variables indicated, in absence of any other interventions. the increased awareness of the infected by testing has a great impact and it is a great differentiator among subgroups (no awareness to high awareness) in the number of cases, critical cases and total number of fatalities throughout time. the peaks are significantly decreased by awareness. it is worth noting how the number of fatalities can be brought almost to zero by complete awareness of infection and isolation. universal testing and isolation, if possible could be one of the great modifiers of the outcome of the outbreak. the above static interventions were evaluated in terms of a sustained action over a parameter at different levels and its impact on the outbreak outputs. in outbreaks, aside from the immediate management of needs and resources, the time to return to normal becomes of great concern. in this second section, dynamic interventions are evaluated specifically in terms of the ending of social isolation measures once different threshold values for r (everchanging due to interventions to manage infectability) of the fatality rate are reached. the model dynamic calculation of r allows for the evaluation of the use of the this variable as a criterion for the relaxation (or application) of interventions. these dynamic scenarios are considered of potential interest as governments and local authorities must evaluate and decide on when to apply the social distancing and isolation mitigation measures; whether it can be done totally or gradually by subgroups; and the potential impact of ending social isolation will have in further behaviour of the disease spread. the impact of ending social isolation upon reaching different threshold values of the r (as a function of all interventions to decrease infectibility) is evaluated in this scenario. this intervention is implemented by starting with initial social isolation in place and average number of daily social interactions (nih) with value of and returning it back to its default "do nothing" value once the threshold r value is reached. figure shows the model predictions for this scenario, in terms of the output variables indicated. the results in figure (top left) clearly indicate the a withdrawal of isoaltion measures when r values remain above will lead to little impact on the total fatalities. it is also observed that when isolation is ended even at low threshold r values, increases in the production of new crude number of fatalities and a peak in critical cases occurs after a period of time. these are always accompanied to sudden spike in r for a short period before its collapse. complete end of isolation may prove to be not the best course of action until r has reached levels of much lower than . the impact of ending social isolation for all, except for those over years old, upon reaching different threshold values of the r is evaluated in this scenario. this intervention is implemented by starting at an initial social isolation in place with a value of for the average number of daily social interactions (nih) and, once the given threshold r value is reached, returning it back to its default "do nothing" value for all age groups except the elderly. figure shows the model predictions for this scenario, in terms of the output variables indicated. the results in figure (top left) show again that an impact in fatalities will occur if the isolation ends at values of r over . the impact of ending social isolation at any r value is in this case smaller as the elderly remain isolated, this is in line with the results obtained in scenario and shown figure . the decrease of total fatalities of those in the age over observed when isolation ends at, higher than , increasing threshold r values, is somehow unexpected. the impact of ending social isolation upon reaching different threshold values of the daily fatality rate (after it has passed its maximum) is evaluated in this scenario. the daily fatality rate is selected instead of e.g. the number of cases because it can be much more exactly assessed (incontrovertible as opposed to the number of cases). the decrease in the fatality rate is usually reached after the decrease in number of cases ("over the peak") as shown by most epidemiological curves for covid- published so far. this intervention is implemented by starting at al initial social isolation in place with a value of for the average number of daily social interactions (nih) and returning it back to its default "do nothing" value when, after the rate passed its maximum, the given threshold value of the rate is reached. figure shows the results for this scenario for the output variables indicated. figure . impact of ending social isolation (nih from back to ) once the fatality rate, after surpassing its maximum, reaches different threshold values, on the final total number of fatalities (top left); the final total number of fatalities per age group (top right) as well as for the different time course profiles of the total active cases (middle left); reproduction number (r ) (middle right); the number of critical cases (bottom left) and the number of fatalities (bottom right). numbers are as percentage of total population. in this scenario all social isolation are ended once the fatality rate reaches a threshold after it started declining. as shown in figure it appears to be a very narrow threshold from which the isolation measures can be withdrawn with low impact on total fatality. if measures are ended just before the threshold is reached the overall fatality rate and the fatality rate for elders rises sharply. for values below that threshold a still a decrease in total fatalities can be obtained if lower thresholds of fatality rate for isolation end are used. the impact of ending social isolation for all, except for those under years old, upon reaching different threshold values of the fatality rate, after it reaching its maximum, is evaluated in this scenario. this intervention is implemented by starting at al initial social isolation in place with a value of for the average number of daily social interactions (nih) and returning it back to its default "do nothing" value, for all age groups except the elderly, once, after the fatality rate reached its maximum, the given threshold value of the rate is reached. figure shows the model predictions for this scenario, in terms of the output variables indicated. in this scenario all social isolation, except of that for the elderly, ends once the fatality rate reaches a threshold after it is already declining. as shown in figure , and analogous to the previous scenario, it appears to be a very narrow threshold from which the isolation measures can be withdrawn with low impact on total fatality. also, if measures are ended just before the threshold is reached, the overall fatality rate rises sharply similarly. for values below that threshold however here no decrease in total fatalities is predicted when using lower thresholds of the fatality rate to end isolation. these results are also consistent with the idea that isolation of the age groups more vulnerable to the disease should be maintained. the model requires parameter calibration against valid data from representative populated cities. data from cities in which population is typically very interconnected socially in public areas and public transport is widely used, are particularly suited for calibration of this model. the model in its current version would benefit from more detailed descriptions and sub models of some of the intervention relevant parameters such as the levels of social interaction and personal protection measures. the model modularity and its fast computation allows for its easy scale up into multiple population nucleus that could be simulated in parallel with degrees of interconnectivity among them. separate independent copies of the model can be run in parallel one for e.g. each city in a region or country and migration terms can be added between cities. interventions can then be defined to include e.g. travel restrictions between those cities at different levels. the mechanistic nature of the model makes it also very suitable for the evaluation of advanced optimisation and optimum control strategies. its capacity of describing complex interactions makes it also of potentially great use to develop advanced artificial intelligence (ai) algorithms to aide and provide advice to authorities during decision making. ai algorithms could be trained by evaluation of very large numbers of scenarios combining static and dynamic interventions of different types against total fatalities and economic damage. lpas . . . . . . . . . * rationale: no reduction factor (rfips = ) of their social interactivity respect to healthy ones is applied for presymptomatic infected individuals as they are ignorant of their condition; symptomatic infected individuals are expected to reduce their social interactivity respect to healthy ones as they feel sick (rfips < ) ; the default level of personal protection and awareness (lpa) in children and youngsters is taken as smaller than that of adults; adult symptomatic individuals are expected to take higher level of personal protection and awareness (lpas) to not spread any general disease to others irrespective of the knowledge of their specific condition. infectious diseases of humans: dynamics and control modeling the spatial spread of infectious diseases: the global epidemic and mobility computational model a day-by-day breakdown of coronavirus symptoms shows how the disease, covid- , goes from bad to worse. business insider (last accesed a simple epidemic model with surprising dynamics real time bayesian estimation of the epidemic potential of emerging infectious diseases estimates of the reproduction number for seasonal, pandemic, and zoonotic influenza: a systematic review of the literature mathematical models in population biology and epidemiology stochastic epidemic models: a survey analysis of a reaction-diffusion system modelling man-environmentman epidemics the effect of travel restrictions on the spread of the novel coronavirus real-time forecasting of epidemic trajectories using computational dynamic ensembles mathematical models to characterize early epidemic growth: a review the role of the airline transportation network in the prediction and predictability of global epidemics delaying the international spread of pandemic influenza complexity of the basic reproduction number (r ) an interactive web-based dashboard to track covid- in real time. the lancet infectious diseases adaptive human behavior in epidemiological models impact of non-pharmaceutical interventions (npis) to reduce covid mortality and healthcare demand strategies for mitigating an influenza pandemic estimating the number of infections and the impact of nonpharmaceutical interventions on covid- in european countries a microbial population dynamics model for the acetone-butanol-ethanol fermentation process. authorea mathematical models of infectious disease transmission a stochastic differential equation sis epidemic model real-time epidemic forecasting for pandemic influenza feasibility of controlling covid- outbreaks by isolation of cases and contacts. the lancet global health activated sludge models asm , asm , asm d and asm the mathematics of infectious diseases clinical characteristics of asymptomatic infections with covid- screened among close contacts in nanjing networks and epidemic models modeling infectious diseases in humans and animals containing papers of a mathematical and physical character early dynamics of transmission and control of covid- : a mathematical modelling study the incubation period of coronavirus disease (covid- ) from publicly reported confirmed cases: estimation and application modelling during an emergency: the h n influenza pandemic epidemiological models for mutating pathogens the reproduction number of covid- is higher compared to sars coronavirus population biology of infectious diseases: part ii how should pathogen transmission be modelled edge-based compartmental modelling for infectious disease spread actualización nº : enfermedad por sars-cov- mortality by cause for eight regions of the world: global burden of disease study regional patterns of disability-free life expectancy and disabilityadjusted life expectancy: global burden of disease study global mortality, disability, and the contribution of risk factors: global burden of disease study alternative projections of mortality and disability by cause - : global burden of disease study the structure and function of complex networks estimation of the asymptomatic ratio of novel coronavirus infections (covid- ) real-time forecasting of an epidemic using a discrete time stochastic model: a case study of pandemic influenza (h n - ) did modeling overestimate the transmission potential of pandemic (h n - )? sample size estimation for post-epidemic seroepidemiological studies a systematic review of studies on forecasting the dynamics of influenza outbreaks epidemic processes in complex networks materials and strategies that work in low literacy health communication the effect of control strategies to reduce social mixing on outcomes of the covid- epidemic in wuhan, china: a modelling study population balance modeling: current status and future prospects game theory of social distancing in response to an epidemic large-scale spatial-transmission models of infectious disease using "outbreak science" to strengthen the use of models during epidemics real-time forecasts of the covid- epidemic in china from dynamical behavior of an epidemic model with a nonlinear incidence rate using a delay-adjusted case fatality ratio to estimate under-reporting mathematical modeling of infectious disease dynamics the health literacy skills framework the incubation period of coronavirus disease (covid- ) from publicly reported confirmed cases: estimation and application real-time numerical forecast of global epidemic spreading: case study of a/h n pdm basic reproduction numbers for reaction-diffusion epidemic models ebola virus disease in west africa-the first months of the epidemic and forward projections nature-inspired optimization algorithms the simplicity complex: exploring simplified health messages in a complex world. health promotion international spread of zika virus in the americas clinical course and risk factors for mortality of adult inpatients with covid- in wuhan all authors wish to thank khalifa university and the government of abu dhabi for the funding and support. the impact of specific interventions on the outbreak time course, number of cases and outcome of fatalities were evaluated. data available from the covid outbreak as of early april was used. our preliminary results for scenarios above and parameter values used indicate that: . universal social isolation measures may be effective in reducing total fatalities only if they are strict and the average number of daily social interactions is reduced to very low numbers. . selective isolation of only the age groups most vulnerable to the disease (i.e. older than ) appears almost as effective in reducing total fatalities but at a much lower economic damage. the comparison between impacts of social isolation interventions to all or selective by age on the final total number of fatalities ( figure ) shows that the isolation of the elderly can achieve equivalent impact to that of all. . an increase in the number of critical care beds could save up significant numbers of lives. using our current parameters values, for a one million population, an estimate of fatalities could be avoided per extra available critical care unit. . the use of protective equipment (ppe) appears capable of very significantly reducing total fatalities if implemented extensively and to a high degree; . extensive random testing of the population leading to infection recognition and subsequent immediate (self) isolation of the infected individuals, can dramatically reduce the total fatalities but only if implemented to almost the entire population and sustained over time. . ending isolation measures with r above one (with a safety factor) appears to renders the previous isolation measures useless as fatality rate eventually reaches values close to if nothing was ever done; . ending isolation measures only for the population under y/o with r values still above one increases total fatalities but only around half as much as if isolation is ended for everyone. . a threshold value for daily fatality rate (equivalent to the r below one) appears to exist for the feasible end of isolation measures. daily fatality rates are known very accurately unlike the r and could be used criteria for intervention. in figure the impacts on total final number of fatalities of the withdrawal of social isolation from threshold values for (r ) and of daily fatality rate per million people are shown. a comparison between the cases when withdrawal is done universally or restricted only to those under years old is shown.it is important to note that any interpretation of the above results for the covid outbreak interventions must be considered only qualitatively at this stage due to the low confidence (lack of complete and valid data) on the parameter values available at the time of writing. any quantitative interpretation of the results must be accompanied with a critical discussion in terms of the model limitations and its frame of application. next immediate steps involve the sensitivity analysis of the parameters with the lowest confidence. a roadmap for model expansion and broader implementation is discussed below. the for matlab® source code and excel file containing all parameter values used as well as a nonage segregated version of the model are available at https://github.com/envbioprom/covid_model appendix appendix i. epidemiological and clinical parameters per age group: covid case study table a . default epidemiological and clinical parameters per age group used in the covid outbreak case study simulations presented.appendix ii. data sources for the epidemiological and clinical parameters: covid case study table a . data sources and level of confidence assigned to the epidemiological and clinical parameters from table a for the covid outbreak case study. the model simulation of the outbreak time course under the default parameters and no intervention is presented in figure a . key: cord- -ve krq authors: stebler, rosa; carmo, luís p.; heim, dagmar; naegeli, hanspeter; eichler, klaus; muentener, cedric r. title: extrapolating antibiotic sales to number of treated animals: treatments in pigs and calves in switzerland, – date: - - journal: front vet sci doi: . /fvets. . sha: doc_id: cord_uid: ve krq to evaluate the contribution of antimicrobial use in human and veterinary medicine to the emergence and spread of resistant bacteria, the use of these substances has to be accurately monitored in each setting. currently, various initiatives collect sales data of veterinary antimicrobials, thereby providing an overview of quantities on the market. however, sales data collected at the level of wholesalers or marketing authorization holders are of limited use to associate with the prevalence of bacterial resistances at species level. we converted sales data to the number of potential treatments of calves and pigs in switzerland for the years to using animal course doses (acd). for each authorized product, the number of potential therapies was derived from the sales at wholesaler's level and the acd in mg per kg. for products registered for use in multiple species, a percentage of the sales was attributed to each authorized species according to their biomass distribution. we estimated a total of , , therapies for pigs and , , for calves in . using the number of slaughtered animals for that year as denominator, we calculated a treatment intensity of . therapies per pig and . per calf. between and , sales of veterinary antimicrobials decreased by %. the calculated number of potential therapies decreased by % for pigs and % for calves. an analysis of treatment intensity at antimicrobial class level showed a decrease of % for colistin used in pigs, and of % for macrolides used in both pigs and calves. whereas the use of rd and th generation cephalosporins in calves decreased by . %, usage of fluoroquinolones increased by . % in the same period. corresponding values for pigs were − . and + . %. this is the first extrapolation of antimicrobial usage at product level for pigs and calves in switzerland. it shows that calves were more frequently treated than pigs with a decreasing trend for both number of therapies and use of colistin, macrolides and cephalosporins rd and th generations. nonetheless, we calculated an increase in the usage of fluoroquinolones. altogether, this study's outcomes allow for trend analysis and can be used to assess the relationship between antimicrobial use and resistance at the national level. to evaluate the contribution of antimicrobial use in human and veterinary medicine to the emergence and spread of resistant bacteria, the use of these substances has to be accurately monitored in each setting. currently, various initiatives collect sales data of veterinary antimicrobials, thereby providing an overview of quantities on the market. however, sales data collected at the level of wholesalers or marketing authorization holders are of limited use to associate with the prevalence of bacterial resistances at species level. we converted sales data to the number of potential treatments of calves and pigs in switzerland for the years to using animal course doses (acd). for each authorized product, the number of potential therapies was derived from the sales at wholesaler's level and the acd in mg per kg. for products registered for use in multiple species, a percentage of the sales was attributed to each authorized species according to their biomass distribution. we estimated a total of , , therapies for pigs and , , for calves in . using the number of slaughtered animals for that year as denominator, we calculated a treatment intensity of . therapies per pig and . per calf. between and , sales of veterinary antimicrobials decreased by %. the calculated number of potential therapies decreased by % for pigs and % for calves. an analysis of treatment intensity at antimicrobial class level showed a decrease of % for colistin used in pigs, and of % for macrolides used in both pigs and calves. whereas the use of rd and th generation cephalosporins in calves decreased by . %, usage of fluoroquinolones increased by . % in the same period. corresponding values for pigs were − . and + . %. this is the first extrapolation of antimicrobial usage at product level for pigs and calves in switzerland. it shows that calves were more frequently treated than pigs with a decreasing trend for both number of therapies and use of colistin, macrolides and cephalosporins rd and th generations. nonetheless, we calculated an increase in the usage of fluoroquinolones. altogether, this study's outcomes allow for trend analysis and can be used to assess the relationship between antimicrobial use and resistance at the national level. keywords: antibiotics, antimicrobial consumption, course dose, pigs, calves introduction use of antimicrobials contributes to the emergence and spread of resistant bacteria in both humans and animals. as early as in the 's concerns arose in relation to therapeutic, preventive and growth-promoting treatments in food-producing animals. the fact that most antibiotic classes are administered to treat infections in both humans and animals was one of the major concerns ( , ) . monitoring antimicrobial usage is therefore a prerequisite to assess the impact of antibiotic treatments on the selection and spread of bacterial resistances. in order to achieve that goal, a number of programs monitoring sales and/or usage of antimicrobials have been established both at national level, for example in switzerland [arch-vet; ( )], denmark ( ), and international level [esvac project of the european medicines agency; ( ) ]. these programs do not only aim at the identification of trends in sales and usage of antimicrobial classes but should also allow establishing a link with changes observed in resistance monitoring programs, thereby providing a basis for risk assessment and evaluation of regulatory interventions ( ) . in order to assess the association between antimicrobial use and resistance, it is of crucial relevance to obtain consumption data at species or, when possible, production type level; there are several species and production type-specific factors that can impact on the relationship between use and resistance. those factors include age at treatment, age and weight at slaughter, products available per species or production type, and especially production structures ( ) ( ) ( ) . antimicrobial sales data are defined as the minimal standard for monitoring programs by the world organization for animal health [office international des epizooties, oie; ( ) ]. they can be collected at either the manufacturer, wholesaler or pharmacy level depending on the national distribution routes of the products. sales data are useful to evaluate long-term trends but do not include information about dose, route of administration, indication or duration of therapy. however, in the context of resistance epidemiology, only data about actual use of antimicrobials collected either at prescription or patient level might deliver the information necessary to establish and evaluate implemented measures. such data can only be currently collected in few countries with advanced collection systems, such as denmark ( ) and the netherlands ( ) among other european countries. the aacting network is maintaining a list of the various collection systems already in place (www.aacting.org). the collection of data at animal level is the ultimate goal of antimicrobial monitoring systems and, until this is available in all participating countries, alternatives using normalization of sales data by the total weight of the food producing animal population as a denominator have been developed. one such denominator is the population correction unit of the esvac project ( ) . other institutions ( ) and countries, including canada ( ) and switzerland [arch-vet; ( )], have implemented similar methods in their surveillance systems. as usage of antimicrobials is strongly dependent on population structure and repartition between high and low-using species, the normalization by weight may provide information on long-term trends but at the same time, a higher usage in one species will be "diluted" by lower usage species/production types (like dairy cows) with a large contribution to the overall livestock biomass ( ) . it is therefore important to measure antibiotic consumption as near as possible to the end users, i.e., to obtain information on species, dosage, duration and whenever possible, indication. the extrapolation of sales data using course doses is an interim measure to data collection at animal level. course dose indicators have been proposed, such as the animal course dose (acd) by the french agency for food, environmental, and occupational health & safety ( ) or defined course dose (dcdvet) from the ema ( ) . an advantage of acd is its product-specific calculation, therefore better representing national specificities than dcdvet units. acds are established for each product using data from the summary of product characteristics (spc) and contain the necessary detail on both dose and therefore potency, and duration of use. the main aim of this study was to provide for the first time an extrapolation of the available national sales data to the number of treated animals in switzerland. we chose to specifically investigate the treatment of pigs and calves because these are mainly reared and treated in groups via oral application. due to the lack of detailed data about repartition of sales, we made assumptions regarding weight at treatment and repartition of sales data between species using a previously published repartition method. we then defined acd for each product containing antimicrobials authorized in switzerland for use in either pigs or calves and combined this information with national antibiotic sales data to extrapolate the number of potentially treated animals during the years to . veterinary antibiotic sales data for the years to were obtained from the federal office of food safety under a confidentiality agreement. since , sales data are collected in switzerland from marketing authorization holders based on article of the ordinance of veterinary medicines ( ) . marketing authorization holders are required to deliver data on every product containing antimicrobials that was sold during a calendar year. products subject to data collection are defined by their atcvet codes ( ) as listed in the esvac project ( ) . additionally, data on antibiotic products not considered by the esvac project, like sprays or products to treat sensory organs, are also collected. data obtained from the federal office of food safety for this study contained the quantity of active antimicrobial ingredient sold in kilogram for each product and year under investigation. the amount of antimicrobials sold of products authorized for a single species was directly assigned to their target species. for each product authorized for more than one species, a repartition had to be determined. we used two distinct methods: the first one was used for premixes, the latter being legally defined in switzerland as being "veterinary medicinal products used to treat groups of animals and incorporated into either water or feed" [ordinance on authorizations for medicinal products, art. ; ( ) ]. for all of these products, periodic safety update reports (psurs) containing data on species repartition submitted to swissmedic, the swiss agency for therapeutic products, during the years to , were used. as premixes represented only products from a total of under investigation but between . % ( ) and . % ( ) of the total sales, another repartition method had to be used for oral solutions, oral powders and injectables. this repartition was done according to biomass repartition as described by carmo et al. ( ) . briefly, for each product authorized for one or more target species, each target species was assigned a percentage of kg of the total sales representing the proportion of its biomass in the total biomass of the list of authorized species for the product. for the present study, food producing animal population numbers were obtained from the federal office of statistics (www.bfs.admin.ch), number of dogs from the anis database (identitas ag, bern, www.anis. ch) and the number of cats from the swiss association of pet food producers (verband für heimtiernahrung, bern, www.vhn. ch). in analogy with calculations of the population correction unit (pcu) of esvac ( ) , the number of slaughtered animals were used for fattening pigs and calves, whereas data for dairy cows, sows, sheep, goats, horses, dogs, and cats represent live animals. throughout the text and in the tables, "pigs" refer to fattening pigs. supplementary table lists the number of animals and the weights used for the biomass repartition. the most likely weight at treatment was sourced from the esvac report ( , ) . as heavy animals with a rather low treatment intensity, like dairy cows, skew the biomass repartition, we chose to only include them in the calculation when they were either explicitly listed as authorized species ("dairy cows") or, when a withdrawal time for milk was given in the spc of the product. for pigs, we did not include the production stage of piglets or weaners. using the number of animals in different production stages presents some challenges, the most prominent one for pigs being the lack number of acds = total quantity of active ingredient sold in one year (mg) daily dose mg kg × duration of tratment days × weight at treatment (kg) therapeutic intensity in species x = number of acds in species x total number of animals for species x of available data for the repartition of use between piglets and e.g., fattening pigs. only few antimicrobials are primarily used in piglets or weaners, colistin being such an example. for almost all other products authorized for pigs, no data are available to stratify antimicrobial consumption per different age classes using sales data. repartition data will only be available once reporting of all treatments with antimicrobials in switzerland is made mandatory at the end of the year . for this reason and because sales data include the use of antimicrobials by all the age categories of the species for the years under investigation, we used slaughtered numbers of pigs as denominator for the therapeutic incidence in this species. finally, for injectable products authorized without indication of the production stage ("bovines" containing dairy cows and "pigs" representing slaughtered pigs and sows) we used raw data provided by experts for the study by carmo et al. ( ) to determine if a use would take place in the particular production stage under consideration. the animal course dose (acd) was calculated for each active pharmaceutical ingredient contained in each product authorized during the years under investigation. data were collected from the authorized summary of product characteristics ( ) and entered into an ms excel sheet containing: name of the product, authorization number, list of authorized species, active ingredient(s), dose and duration. doses given in international units were converted to mg using conversion factors listed in the esvac report ( ) . whenever the recommended dose was a range, the highest recommended dose and longest duration were chosen to reflect the minimal number of animals potentially treated. moreover, when different doses were authorized for different indications, the most likely indication was chosen. this was the case for products presenting both a prophylactic or metaphylactic indication with different doses and duration. acds were defined per kg and the acd per animal obtained by multiplication with the likely weight at treatment. to take swiss specificities into account, the weight at treatment for pigs was taken from a previous study by schnetzer et al. ( ) and the weight for calves based on expert opinion (prof. m. kaske, zurich, personal communication). therapeutic intensity reflects the number of acds per slaughtered animal (pig or calf) in year. for combination products, the number of acds was calculated separately for each active pharmaceutical ingredient. therefore, a single treatment with a combination containing antimicrobials results in acds. acd and intensity were calculated using the following equations: from the year to , sales of antibiotics for use in food producing animals decreased by . % ( table ). in the same time, the percentage represented by premixes decreased from . to . %. therefore, measured in kg, antimicrobials sold in premixes made the largest part of yearly sales of antimicrobials for the veterinary medicine. as a consequence, pigs and calves are the most pertinent species among food producing animals to be investigated for use and trend detection. in tonnage sold for use in these species, the decrease in the years under investigation is comparable: . % in pigs and . % in calves. however, normalizing these numbers to the respective biomass of the produced (slaughtered) population reveals a much higher use per kg of biomass for calves ( . mg/kg biomass in ) than for pigs ( . mg/kg biomass). the difference between both species even increased from . -fold higher for calves in to . in . normalizing sales data to either the overall biomass of food producing animals or to the biomass of a particular species is a crude estimate of antimicrobials use, not taking dose or duration into account. we therefore calculated the number of course doses (acds) per product and species. a summary of the results is presented in table . the total number of acds was approximately . times higher in pigs and decreased by . % over the years under investigation, whereas the decrease for calves was . %. normalization to the number of slaughtered animals showed a much slower decrease of . % for calves between and compared to . % in pigs. as a result, the difference between both species grew from . -fold in the year to . -fold in the final year under investigation. not all antibiotics have the same potential impact on resistance selection and consequences for the treatment of both humans and animals. moreover, different products are authorized for distinct conditions in pigs or calves. the repartition of the number of acds per class of antimicrobials was therefore calculated separately for each species for the year . table presents the repartition by antimicrobial class for ingredients sold in premixes and as parenteral injections. in this year, polymyxins (in form of colistin) were the class with the highest potential numbers of acds per pigs, followed by sulfonamides. in calves, the highest number was represented by penicillins (mainly sold as aminopenicillins) followed by tetracyclines. the total number of acds per animal was . times higher in calves than in pigs. the same calculation was done for injectable products as these may contain antimicrobials of the highest priority [hpcia; ( ) ] not available for oral application. for pigs, the highest number of acds per animal in the year was represented by macrolides, followed by aminoglycosides and fluoroquinolones. for calves (amino)penicillins were the class with the highest number of course doses per animal, followed by macrolides and aminoglycosides. the total number of potential acds per animal for injectable products in the year was . for calves and . for pigs. finally, the evolution of the number of potential acds per animal for hpcias is presented in table . for macrolides used in pigs, a decrease of . % for products sold as premixes was attenuated by a corresponding increase of . % for injectables. this pattern was even more evident in calves where a reduction of . % for premixes was almost completely compensated by an increase of . % in injectables. with respect to the other two classes of hpcias, sales of products containing fluoroquinolones remained stable for pigs (− . %) and an increase of . % was observed for the number of potential acds per animal in calves. courses with cephalosporins of the third and fourth generations showed a comparable decrease in pigs (− . %) and calves (− . %). this is the first study at national level using the acd concept applied to sales of antimicrobials with the objective of extrapolating the number of potentially treated pigs and calves in switzerland. sales of antimicrobials for the veterinary medicine are published at national level since . so far, these data represent the only available source of exhaustive antimicrobial consumption data at national level. sales figures may allow for the recognition of trends, but the lack of information on potency, dose, duration of treatment and repartition per species strongly limits their usefulness. the indicator acd may therefore help to bridge that gap. calculation of acd and repartition of quantities for products authorized for more than one species would not be possible without making assumptions, which might influence the results. the first assumption relates to the weight of the animals. the standard weight has an impact on both the calculation of the species repartition and the acd indicator itself. the impact of using different weights is a topic beyond the scope of this study and the impact on calculations has been studied elsewhere ( ) ( ) ( ) . in this study we used weights at treatment as close as possible to the swiss reality. this should provide the best fitting results, and also guarantees future reproducibility of the method and comparison of results, as these weights are likely to be used when quantifying swiss antimicrobial consumption both at national and international level. this approach is comparable to the one chosen by the esvac project. the method used to stratify antimicrobial consumption by the production types included in the study has some potential bias. as it is based on the total biomass of each animal category, the resulting estimates are highly dependent on the animal demographics and the animal average weight used. this might not always be a representative surrogate of the product repartition by each category. as a reliable repartition is generated by data collected on actual usage, and such data are currently not available in switzerland, we chose an alternative that was applicable at product level that would deliver reproducible results over the years and be as accurate as possible. carmo et al. ( ) have compared three different methods to determine species repartition of antimicrobials. the longitudinal study extrapolation method (based on field data) was not applicable at single product level due to the requirement for minimum, mode, and maximum starting values. the biomass distribution was shown to be the method providing the closest results to the extrapolation based on field data, thereby increasing our confidence on the pertinence of the approach we applied. the two main drawbacks of this method are the dependence on defined average weights and country specific animal demographics. however, the method, limited by the data available in the current swiss context, provides a first insight into antimicrobial consumption patterns in different species/production types. in the future, the data collection system is-abv (description available under http://www.aacting.org/matrix/is-abv/?lid= ) shall provide further insights into these patterns, as well as a basis for comparison with the results from the method and its potential biases. to make our extrapolations as comparable as possible with other projects, we used the same standard weights as in the esvac project ( ) . it must also be noticed that the denominators of the indicators presented were based on the number of slaughtered animals only. the weights used for the calculation of the biomass were likely weights at treatment as defined in the esvac project ( ) . the use of such a calculation might hinder direct comparisons with other studies and should be taken into consideration when benchmarking these results. when using the biomass as a denominator, the result should be interpreted as an indicator for the amount of active ingredient used per kg of animal produced. likewise, the therapeutic intensity indicates the average number of acds per animal produced/slaughtered. both a high proportion of heavier animals like cows or, alternatively, a high treatment intensity in a species of lower biomass are examples of how animal demographics can bias the results of the stratification approach based on the biomass. the repartition across species is mainly influenced by national production structures. in switzerland, dairy production is an important agricultural sector and therefore dairy cows make a high proportion of the food producing animals' sector ( ) . cows represented % of the total biomass in the year and this high proportion leads to an underestimation of the repartition of sales for pigs or calves. this primarily affects the repartition of aminoglycosides and cephalosporins of the third generation, which are antimicrobials frequently used in the treatment of dairy cows. the calculated numbers of acds per animal for these classes presented in table are, therefore, an underestimation. within the same species, biomass repartition could have been used to estimate the use of antimicrobials in different production stages of pigs. however, using piglets, weaners and fattening pigs produced during year introduces the bias of counting a significant but undefined proportion of the animals two or three times. as sales data were only available for one full year, we therefore chose to base our repartition, as well as the denominator for the treatment intensity, on the number of pigs slaughtered during the same year. this indicator is used in this study as a surrogate for all pig production stages. as the numbers of acds represent an extrapolation of usage data based on sales figures, they follow the latter closely. the downward trends in sales is mirrored by the treatment numbers of both calves and pigs. however, differences become evident as soon as additional factors like application route are taken into account. the repartition for pigs in the year shows that % of the active ingredients were used parenterally when based on quantity, whereas they represented % of the treatments when using acds. the main reason for this difference lies in the potency of the active ingredients: antimicrobials are used parenterally with a lower dose as there is no loss of active ingredient compared to the lower bioavailability following oral application. another possible reason is the use of more than one acd for parenterally applied combination products as of injectable products investigated were combinations of two active ingredients. whereas, this approach can be disputed as it shows a higher number of "treatments, " we think that the use of acds is better suited to test for associations between antimicrobial use and resistance. converting sales of antimicrobials to number of treatments per animal allows detection of trends that would not be obvious when only assessing the quantity of active ingredients sold. macrolides used to treat calves provide a good example: our results show a clear shift from the oral application in form of premixes toward an increase in the use of injectables. one possible explanation is the increasing availability of macrolide antibiotics with a long duration of action, e.g., tulathromycin, tildipirosin, and gamithromycin. such active compounds combine the easy use of a single application with a long action. moreover, for parenteral applications, both time to maximal concentration and maintenance of active levels are not influenced by the appetite of the animals, therefore guaranteeing the adequate treatment of sick animals with reduced appetite. on the negative side, studies about macrolides used in human medicine convincingly showed a higher level of resistance selection for longer acting molecules ( ) . our results show a strong difference in the extrapolated usage of antimicrobials in pigs and calves. this cannot be explained by a single factor as the administration of antimicrobials is driven by medical, economic and also psychosocial factors. crowding effect, stress during transport of very young, not yet immunocompetent animals, partially inadequate colostrum feeding and less than ideal stable climate are among the factors favoring respiratory problems in calves ( , ) . in the swine industry, some of the abovementioned factors also exist, but the structure and management of pig production limits the risks. management practices like all-in-all-out including disinfection between the batches or integrated production from piglet to finisher can strongly help to reduce antimicrobial usage. in pigs, there are two main periods at risk for treatment with antimicrobials: the first at weaning with around kg body weight and the second at around to kg body weight ( , ) . in pigs, diarrhea is one of the leading indications for treatment. this is a very unspecific symptom with many different causes, including not only bacterial but also dietary or viral origins. in this context, the availability of vaccines against both circovirus and lawsonia intracellularis infections in the years to contributed to the reduction of diarrheal symptoms, and hence, the rather indiscriminate use of antibiotics to treat such symptoms. for calves, respiratory diseases are much more multifactorial and the introduction of various vaccines (against bovine respiratory syncytial, parainfluenza or corona virus) seems not to have had the same positive effect as in the pig industry. several factors hinder a proper comparison of our results with previously published data. to the best of our knowledge, this is the first time that the acd indicator is used at national level in switzerland. as a matter of fact, its use is not currently widespread in other countries, with the exception of france where it was developed. however, the comparison with french data is difficult. no publication presents the french antimicrobial consumption using acds per animal and year as an indicator. the french indicator for exposure to antimicrobials is alea [animal level of exposure to antimicrobials; ( ) ]. it is obtained by dividing the effectively treated biomass by the total biomass of the same species. the global alea calculated for the year in france was . and represented a decrease of . % compared to . another difficulty is the use of different production categories and standard weights at treatment. for pigs, the french system uses weights up to kg for a specific category of sows and the average for the pig population is set at kg. this is . times higher than the standard weight at treatment of kg identified in previous swiss studies and used here. the differences in the standard weights at treatment also explain the discrepancies in the antimicrobial consumption for france published, for the same year, in the esvac report ( mg/pcu) and in the anses report ( mg/kg). due to the differences in weight and in the categories, and the difficulties in making assumptions and extrapolations, we decided not the compare our figures to the french ones. our data can only be compared with countries where calves are reared for the production of veal meat. besides france and belgium (for which country we could not find adequate data for comparison), this production system also exists in the netherlands. the available report for the year ( ) uses indicators differing from the ones in the present study but still shows a higher treatment intensity in calves compared to pigs. this is in line with the present study, where the antimicrobial use was . -fold higher in calves than in pigs. both examples clearly illustrate the need to harmonize methodologies at international level in order to discuss data collected in different countries. such discussions currently take place within the aactng network (www.aacting.org). this first study of the number of treatments of pigs and calves extrapolated from yearly sales shows both similarities and differences between the two species under consideration. whereas, the sales by species and the number of extrapolated treatments both decreased in a similar way, the difference in the number of treatments per animal between pigs and calves differed over the years under investigation. given that the applied method is based on the extrapolation of sales figures, a similar decrease at species level was to be expected. however, the use of course doses allows to further investigate trends in the patterns of antimicrobial treatments. in our study, this was very clear for the class of macrolides, for which the decreases in oral use were partly (pigs) or completely (calves) compensated by the application of long acting injectables. we, therefore, recommend the use of extrapolated treatment numbers when no exhaustive collection of usage data is in place. the concept of acds can also complement the collection of antimicrobial consumption data at species level allowing for their validation using sales data. all datasets generated for this study are included in the manuscript/supplementary files. rs did all the calculations presented in this work. lc helped with the repartition of sales between species, expert advice and biomass distribution. raw sales data and advice regarding their use was provided by dh. the study was designed by cm and supervised by ke and hn. joint committee on the use of antibiotics in animal husbandry and veterinary medicine world health organization. the medical impact of the use of antimicrobials in food animals arch-vet : bericht über den vertrieb von antibiotika in der veterinärmedizin in der schweiz use of antimicrobial agents and occurrence of antimicrobial resistance in bacteria from food animals, food and humans in denmark european surveillance of veterinary antimicrobial consumption. sales of veterinary antimicrobial agents in european joint fao/oie/who expert workshop on non-human antimicrobial usage and antimicrobial resistance: scientific assessment the antimicrobial resistome in relation to antimicrobial use and biosecurity in pig farming, a metagenome-wide association study in nine european countries association between antimicrobial usage, biosecurity measures as well as farm performance in german farrow-to-finish farms factors associated with high antimicrobial use in young calves on dutch dairy farms: a casecontrol study antimicrobial resistance: monitoring the quantities of antimicrobials used in animal husbandry monitoring of antimicrobial resistance and antibiotic usage in animals in the netherlands in trends in the sales of veterinary antimicrobial agents in nine european countries annual report on antimicrobial agents intended for use in animals canadian antimicrobial resistance surveillance system comparison of the sales of veterinary antibacterial agents between european countries anses. suivi des ventes de médicaments vétérinaires contenant des antibiotiques en france en . anses -agence nationale du médicament vétérinaire european surveillance of veterinary antimicrobial consumption. ema/ / : defined daily doses for animals (dddvet) and defined course doses for animals (dcdvet) tamv. verordnung über die tierarzneimittel who collaborating centre for drug statistics methodology norwegian institute of public health ambv. verordnung über die bewilligungen im arzneimittelbereich approaches for quantifying antimicrobial consumption per animal species based on national sales data: a swiss example validation of the exposure assessment for veterinary medicinal products expert opinion on livestock antimicrobial usage indications and patterns in denmark, portugal and switzerland institut für veterinärpharmakologie und -toxikologie. swiss veterinary drug compendium therapieintensität beim einsatz von fütterungsarzneimitteln bei schweinen [calculation of therapeutic intensity for pigs treated using medicated feed critically important antimicrobials for human medicine - th rev prospective study on quantitative and qualitative antimicrobial and antiinflammatory drug use in white veal calves comparing antimicrobial exposure based on sales data antimicrobial use in pigs, broilers and veal calves in belgium. vlaams diergeneeskundig tijdschrift effect of azithromycin and clarithromycin therapy on pharyngeal carriage of macrolide-resistant streptococci in healthy volunteers: a randomised, double-blind, placebo-controlled study risk factors for death and unwanted early slaughter in swiss veal calves kept at a specific animal welfare standard antimicrobial drug use and risk factors associated with treatment incidence and mortality in swiss veal calves reared under improved welfare conditions berechnung der therapieintensität bei ferkeln und mastschweinen beim einsatz von antibiotika in fütterungsarzneimitteln monitoring of antimicrobial resistance and antibiotic usage in animals in the netherlands in the supplementary material for this article can be found online at: https://www.frontiersin.org/articles/ . /fvets. . /full#supplementary-material conflict of interest: the authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.copyright © stebler, carmo, heim, naegeli, eichler and muentener. this is an open-access article distributed under the terms of the creative commons attribution license (cc by). the use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. no use, distribution or reproduction is permitted which does not comply with these terms. key: cord- - iwzsp authors: ng, travis; chong, terence; du, xin title: the value of superstitions date: - - journal: j econ psychol doi: . /j.joep. . . sha: doc_id: cord_uid: iwzsp this paper estimates the value of superstitions by studying the auctions of vehicle license plates. we show that the value of superstitions is economically significant, which justifies their persistence in human civilization. we also document the changes of the value of superstitions across different types of plates, across different policy regimes, and across different macroeconomic environments. interestingly, some of the changes are rather consistent with economic intuition. the questions we address differentiate from those in the three papers. instead of documenting the underlying pricing mechanism of license plates, we take the price of a plate as an equilibrium outcome that reflects its social value. we are interested in determining how significant the value of superstitions is. in addition, beyond looking at the value of superstitions from a static point of view, we look at it from a dynamic perspective. in particular, we ask whether the value would change across the different types of plates, different policy regimes, and different macroeconomic environments. two key features in our empirical analysis enable us to address these new and interesting questions. first, our data cover two distinct periods: before and after a major policy change in , which is the introduction of an entirely different type of license plates. it provides us a unique opportunity to study the change in the value of superstitions in response to the policy change. we also link the data to the stock market index and examine whether the value of superstitions varies with the macroeconomic environment. second, we estimate plates of different types separately, rather than jointly as in woo and kwok ( ) , and woo et al. ( ) . in addition to being a strategy to disentangle the effect of superstitions from the effect of conspicuous consumption, a point discussed in section . , our estimation strategy allows us to ask how the value of superstitions changes across different types of plates, a question addressed in section . . . it is important to note that although we use a different estimation strategy, our estimated coefficients are consistent with those in woo and kwok ( ) , and woo et al. ( ) , suggesting that their qualitative results are robust. along the same line, foster and kokko ( ) presented a simple model in which the false causalities people develop survive over natural selection, therefore becoming superstitions. for instance, mr. dilip rangnekar and miss elizabeth young from the communications department of otis elevator company, the world's largest elevator manufacturer, confirmed to us on november , that roughly % of their elevators around the world do not have the th-floor button. many accessories containing an evil eye are sold in europe. kramer and block ( ) give more examples on superstition and marketing practices. woo and kwok ( ) use the data of , license plates from to , chong and du ( ) to the best of our knowledge, this paper is the first in economics to study how the value of superstitions responds to policy changes, and how the changes per se square with economic intuition. it is also the first one to document that the value of superstitions may vary across different macroeconomic environments. this study takes advantage of one of the few datasets that makes it possible to link superstitions with social value: the auction data of vehicle license plates in hong kong. over % of the hong kong population is of chinese descent, and most understand cantonese. as the largest ethnic group in the world, the chinese would have to be quite superstitious to justify the following observations: the $ round-trip deal from new york to beijing by continental airlines in , the beijing olympics opening ceremony at / / at p.m., the missing th, th, and th floors in many apartments in hong kong, and the peak of cardiac mortality of chinese americans and japanese americans (for whom '' " is unlucky) on the th of the month, a striking pattern absent in white americans. the hong kong government started auctioning license plates in . the government is the only institution that sells plates through an open auction. table summarizes the main features of the different types of plates available for auctions. only traditional plates were available before september . they consist of either no letter prefix or two-letter prefix, followed by a number between and (e.g., ab , lb , and ) . the law further groups traditional plates into two mutually-exclusive types: ordinary and special. we now briefly explain their key distinctions, namely, assignment and transferability. the government automatically assigns to mary, for example, an ordinary plate upon registration of her vehicle (usually by sequence). if mary does not like the plate, she can return it to the government and bid a plate she likes in an auction. mary can go to an auction, but there is no guarantee that she would find one there that she likes. alternatively, she can reserve one unassigned plate (including ordinary and special) in advance and go to the particular auction to bid on the plate she has reserved. of course, there is still no guarantee that she will win that plate. mary can also buy a plate from someone else. the government does not assign special plates, which have generally more appealing numbers. if mary wants a special plate from the government, she can only get it through the auction. owners can legally transfer their ordinary plates. if mary likes peter's ordinary plate, they can trade with each other. in contrast, if mary likes john's special plate, even if john would like to sell it to her, they cannot legally make a deal. phillips et al. ( ) examine mortality data in the us and find such a striking pattern. in , the legislation whereby vehicle plates could be sold by auction was introduced. the proceeds of the auctions go to a charity fund called lotteries fund. specifically, it is the right to put a specific number on a license plate that is sold, not the plate per se. the buyer can put the specific number on any plates he likes. the earliest plates contain numbers only. when the number of vehicles had used up all the plates containing numbers only, the government added roman alphabet such as ''hk" and ''xx" as prefixes. after ''hk" and ''xx" were exhausted, license plates starting with ''aa", ''ab", ''ac", and so on were used. as stated in schedule of the road traffic (registration and licensing of vehicles) regulations (cap. sub.leg. e), a plate is special if it satisfies one or more of the following criteria: ( ) no letter prefix, ( ) number below , ( ) number in hundreds or thousands, ( ) symmetric mark, ( ) sequential mark, ( ) two pairs, ( ) alternate pairs, ( ) mark with identical numbers. this non-transferability restriction is intended to curb speculation. however, some ''clever" practices can get around this restriction. this can have implications on our estimation strategy as we will argue in section . . in march , the hong kong government first proposed to introduce personalized plates. the motivation was the fiscal budget deficit, a result of sars in and the economic downturn from to . as the sales of traditional plates went to a charity fund instead, the government proposed selling personalized plates to raise government revenue. subject to certain restrictions, personalized plates allow vehicle owners to personalize their plate numbers up to eight digits (e.g., love u, www, relax, etc.). the first auction of personalized plates was held in september . the hong kong government sells plates by english oral ascending auctions. ordinary plates have a reserve price of hkd$ (hkd$ . =usd$ ). reserve prices vary for special plates and are set by the government. the auctioneer can raise the minimum bid increment during the auction. auctions are usually held during weekends and chinese lunar new year holidays. there is no fixed schedule, and therefore the number of auctions in a given month can vary. on average each auction sells more than a hundred different plates sequentially. auction theory suggests that in an english oral ascending auction, if there is no binding minimum bid increment and binding reserve price, and the valuations of bidders are independent, then in equilibrium the winner pays an amount equal to the valuation of the second highest bidder. the winning bid is thus the social opportunity cost. controlling for other factors, if superstitions determine the prices, the auction data would allow us to link superstitions with social value. ideally, if plates are all non-transferable, as is supposedly the case for special plates, then the assumption of independent bidders' valuations seems plausible. as mentioned, however, there are ways to transfer special plates that are supposedly non-transferable ''cleverly". ordinary plates are not subject to any transferability restriction. there are companies bidding plates in auctions, aiming at trading them for profit. their valuations of a plate, therefore, depend on their estimates of its future price. these bidders' valuations have a common value component. this leads to the possibility of the winner's curse: the winning bidder is the most optimistic one who over-estimates the future price. the common value component makes it non-trivial to map winning bids to social cost. these concerns can constrain our empirical strategies. garratt & troger ( ) give a theoretical foundation on auction equilibrium in the presence of speculators in auctions. we obtained our data from the hong kong transport department. they contain auctions of traditional plates from january to january . the data span two very different periods. up to august , only traditional plates were available. the introduction of personalized plates in september marked a transition in this market. there were , traditional plates available for auction; of which , were sold. our dataset does not include the results of the auctions of personalized plates since september . table gives the breakdown of the number of observations by year. we observe the plate number, auction date, whether or not the plate was successfully sold in the auction, and if so, the winning bids. we do not, however, observe the reserve prices of the special plates (which are made known at the auction house right before the auction begins), the number of bids, the bidders' identities, the bid increments, and the sequence of the auction. in addition, we do not observe whether the plates sold in the auctions are for personal use or for trade. table presents the real prices of plates by types and by year. we denominate the nominal prices by the consumer price index (cpi) of the auction month to adjust for inflation. a license plate serves no other purpose, literally, other than to legalize a car to use the road. the plate number does not change this legal function. huge variations in the winning bids of plates, therefore, must have reflected some preferences on the plate numbers per se. we hypothesize that superstitions play a role in explaining price variation, that is, having controlled for other factors, superstitions do explain the price variation. the particular type of superstitions that we would like to focus on is the belief that a number that rhymes similarly to something good (bad) would bring good (bad) luck to the owner. it is a false belief because the likelihood of one getting involved in a car accident depends on her driving habit but not on her plate number. such superstitions can change the allocation of economic resources. for instance, those with ''unlucky" plates drive their cars unnecessarily slower is an economic cost. if such superstitions carry significant economic value, their prices must reflect that. different numbers rhyme differently in cantonese, but there is a universal consensus in hong kong that '' " is good and '' " is bad. the number '' " rhymes similarly to the word ''prosper" or ''prosperity". thus, the superstition is that the number '' " brings prosperity. the number '' " rhymes similarly to the word ''die" or ''death". thus, the superstition is that the number '' " increases the odds of dying. given that these superstitions have persisted for a long time, they must carry a significant economic value. consistent with this logic, the number '' " would carry a significant premium on a license plate, while the number '' " would carry a significant discount. as such, we aim at empirically estimating the premium and the discount. the estimation involves two inherent difficulties. first, license plates are publicly visible; they are conspicuous goods (also known as veblen good). people buy an expensive conspicuous good to signal high income and achieve greater social status. the plates that most believe are expensive would serve such a purpose. they are sold at a higher price because everyone expects so, and this is a self-fulfilling equilibrium. this effect, however, has nothing to do with superstitions. second, plates with different numbers are visually differentiated. some number patterns are generally regarded as visually more appealing than others. differences in visual appeal, therefore, have to be controlled for. to disentangle the conspicuous effect from the effect of superstitions, we use two strategies. first, we perform estimation on plates with different numbers of digits separately. the idea is to exploit the fact that most expect plates with fewer digits to command higher prices. table reflects this expectation. however, not many can tell the price difference among plates with the same number of digits. for instance, most expect ''lb " to be more expensive than ''lb " and ''dr " to be more expensive than ''dr ". not many can tell whether ''lb " or ''lb " is more expensive. second, in each estimation, we control for plate numbers that most people would expect to be expensive, in particular, those with no prefix, prefix of ''hk", and ''xx". most cannot tell the price difference between ''dr " and ''ag ", but they expect ''hk " and ''xx " to be more expensive than ''dr " and ''ag ", respectively. to disentangle differences in visual appeal from the effect of superstitions, we control for a variety of different combination patterns. table summarizes the variables we control for. in particular, we assume that people do not systematically prefer any particular letter or number in terms of their visual appeal. for example, people would not value ''ab" systematically higher than ''jb", or ''kk" higher than ''jj". however, people may value ''kk" systematically higher than ''jb", as the same-letter prefix is visually more appealing. in addition, people would not systematically value '' " higher than '' ", but they would value '' " higher than '' " because of the sequential numbers. the count of number '' " n num the count of number '' " n num the count of number '' " n num the count of number '' " n num the count of number '' " n num the count of number '' " n ordered combinations numxyz whether the three numbers are y ordered in ''x", ''y", ''z" pattern while the structural estimation of auctions is the usual strategy for empirical auction studies, in addition to data limitation, the presence of (i) common value component, (ii) potentially binding reserve price, (iii) potentially binding minimum bid increment, and (iv) sequential auctions of many plates substantially complicate the use of a structural estimation approach. the theory has yet to draw a mapping between winning bids and the bidders' valuations under all these constraints. we therefore abstract away from the use of structural estimations and instead, use the hedonic pricing estimation. the hedonic pricing method studies how the price of a commodity relates to its attributes. court ( ) first introduces the methodology, and lancaster ( ) and rosen ( ) further develop it. woo & kwok ( ) , chong & du ( ) , and woo et al. ( ) employ hedonic estimations as well. mcdonald & slawson ( ) use hedonic estimations to study the effect of a seller's reputation in internet auction. as noted in bajari & hortacsu ( ) , however, using hedonic estimations requires somewhat stringent assumptions to interpret the ''implicit prices" as buyer valuations. interpreting the results of hedonic estimations calls for caution. the regression model takes the following form, estimated by plate types: the notations b, c, d, and k are vectors. the year-month dummies capture the macro-economic environment that systematically affects the winning bids within the month. although we do not observe the number and identities of the bidders, we believe that their composition in any particular auction would influence the winning bids in that auction in certain ways. to take this into account, the model assumes that the error terms within an auction date are correlated in some unknown way, but that plates auctioned on different dates do not have correlated errors. we therefore calculate the standard errors clustered by auction date. in addition, to account for heteroscedasticity, white-corrected standard errors are calculated. our estimation differs from those in woo & kwok ( ) , and woo et al. ( ) . in particular, the two papers estimate the plates with different digits jointly. in addition to disentangling the effects of conspicuous consumption from the effects of superstitions, estimating separately has three other advantages. first, it avoids ambiguous interpretation of some estimated coefficients. second, estimating separately allows the discount of a '' " and the premium of an '' " to vary across plates with different digit. as we will see in section . . , the effect is not constant in general. third, building on the two papers and yielding consistent results under different estimation specifications in our paper help strengthen the results in the two papers. table shows the main results of the regressions of different types of plates. in particular, we found that number '' " is associated with plates with significantly higher winning bids, while number '' " is associated with plates with significantly lower winning bids. controlling for other factors, an ordinary -digit plate with one extra '' " was sold . % higher on average, while an ordinary -digit plate with one extra '' " was sold % lower on average (both relative to the number '' "). the corresponding estimates for ordinary -digit plates are . % and . %, respectively. these figures mean on average that for an ordinary -digit plate, replacing the number '' " with the number '' " would allow the plate to be sold at roughly usd$ higher. on the other hand, replacing the number '' " by the number '' " would allow the plate to be discounted for usd$ . if we do the same replacements in ordinary -digit plate, an '' " adds usd$ to the price, while a '' " reduces the price tag by usd$ . these numbers are significant even in real economic terms. on page of the article, the conditions are as follows: (i) no common value component, (ii) no asymmetric information among bidders about the marginal values of the observed product characteristics, (iii) no minimum bids or reserve prices, (iv) all bidders are ex ante symmetric, (v) all product characteristics are observable, and (vi) entry is exogenous and a dummy variable for the number of bidders is included in the regression. in another specification (which are available upon request), we use year dummies only, but we include the month-end stock market index (i.e., the hang seng index) to proxy for the macro-economic variations. the results are very similar. estimating separately avoids the inherent difficulty of interpreting the estimated coefficients of the dummies that is possible in some but not all types of plates. for instance, in a joint estimation of plates of all digits, the interpretation of the estimated coefficient of the dummy for aabb (two pairs in parallel) is ambiguous. the dummy is equal to if a -digit plate has this aabb pattern (e.g., jk ). however, it is equal to if either a -digit plate does not have this pattern or it is impossible to have this pattern as the plate not -digit. suppose the estimated coefficient is, say, %, then it is difficult to interpret it as a % increase in the selling price with this pattern, or the possibility of having this pattern, together with the fact that the plate does have this pattern, commands a % premium. to be precise, woo & kwok ( ) , and woo et al. ( ) alleviate this concern by using the share of a number in the plate instead of the count of the number in their specification, which instead impose a much milder restriction: that the effect of an increase in the share of a number on the price is constant across plates with different digits. the amount of united states dollars are at price. we obtain similar results on special plates of all digits. in short, consistent with our hypothesis, an '' " does carry a significant premium, while a '' " does carry a significant discount. our results on the numbers and the patterns are also consistent with those in woo & kwok ( ) , and woo et al. ( ) , suggesting that their qualitative results continue to be robust under alternative econometric specifications and with data from a more extended period of time. our results also allow us to address one interesting question: to what extent would the value of superstitions, based on beliefs that are inherently irrational, be explained by economic intuition? we present three pieces of evidence in the next three sections that suggest that the responses of the value of superstitions to changes square with economic intuition well. section . . looks at how the value of superstitions vary with the macroeconomic environment. we first look into the change in the premium and the discount of an '' " and a '' ", respectively, across different types of plates. an analogy can explain the relevant economic intuition. suppose that rather than leaving next year's health to randomness, one can buy a healthy year from god. how much in terms of her share of wealth would she be willing to pay for? the sooner she expects to die, the larger the share she is willing to give up. a year of good health weighs more in a shorter life. analogously, one's willingness to pay to acquire an '' " (or to get rid of a '' ") would increase across plates with fewer digits on average. the increase is due to the bigger share of a number in plates with fewer digits. the results in table allow us to examine whether the estimates are consistent with this logic. the discounts on a '' " are all statistically significant across all types of plates. more interestingly, the size increases as we move from ordinary -digit plates to ordinary -digit plates, with the increase from % to . %. estimates of special plates exhibit a similar pattern: a larger discount when moving from -digit, to -digit, to -digit, and to -digit, with figures of . %, . %, . %, and . % respectively. we find the exact same pattern on the premium on an '' ". an unlucky '' " is bad, but it is the worst if it is on a -digit plate than on a -digit plate. on the other hand, a lucky '' " is good, but it is the best if it is on a -digit plate. such a pattern is not generally true for other numbers. we also use the introduction of personalized plates as a natural experiment. after the introduction of personalized plates in , we ask whether the sizes of the premium on an '' " and the discount on a '' " change, and if so, in which direction. although based on irrational beliefs, any particular superstition must have a demand function. economic intuition suggests that a demand would change under exogenous change: the demand for a superstition would shift down as substitutes are introduced. expressing oneself by means of a personalized plate rather than having a ''lucky" plate did not become an option until . formally, in the regressions, we hypothesize that the coefficients of '' " and '' " differ before and after . precisely when the effect of personalized plates started is hard to say. we believe it is more reasonable to think that it started at the beginning of when the bill was finally passed and people started reserving for their personalized plates for auctions rather than when the first auction for personalized plates was held. table shows the results. the sizes of the premium on an '' " and the discount on a '' " universally reduced after for all the -digit (including ordinary and special) and -digit ordinary plates. the results of the wald tests on these three types of plates indicate that the estimated coefficients of a '' " and an '' " differ significantly before and after , suggesting that the introduction of personalized plates may have changed the value people place on superstitions. for instance, for an ordinary -digit plate on average, an '' " carries a . % premium before but only . % after . the corresponding figures for an ordinary -digit plate are . % before and % after . the significance of the differences of the estimated coefficients is marginal for -, -, and -digit special plates. except for these three types of plates that comprise a little less than % of the data, we do see a universal pattern of the changes in the premium and discount before and after . which types of plates should the value of superstitions be most responsive to the introduction of personalized plates? consumer theory suggests that the degree of substitutability is larger in a pair of substitutes that sells at similar price ranges than in a pair that sells at very different price ranges. for instance, introducing a new cadillac model should affect the demand for mercedez e-class more than that for honda civic. although we do not have data on the price of personalized plates, we can infer its price range as the government set their reserve price uniformly at $ . this is the minimum that anyone who bids for any personalized plate would have to pay. the average price therefore should be at least higher than $ (denominated by cpi, the amount is roughly note that our estimated coefficients imply that the effect of an increase in the share of a number on the price is not constant across plates of different digits. for instance, if the effect were the same, the estimated coefficient of num in column would have been À % ( / ) = . % instead of . % as in our estimation. table shows that the average real price for ordinary -digit plates was around $ before and that for ordinary -digit plates was roughly $ . it is therefore reasonable to expect that the personalized plates should be more appealing to people who would be more inclined to buy ordinary -digit plates rather than ordinary -digit plates. the degree of substitutability between ordinary -digit plates and personalized plates should be higher. we therefore expect that the impact of the introduction of personalized plates on the value of superstitions is larger on ordinary -digit plates than on digit ordinary plates. table reports the consistent results. the change we look into is the premium on replacing a '' " with an '' " in ordinary plates. on average, such a number replacement adds . % to the price of ordinary -digit plates before but only . % after . the percentage drop was . %. corresponding figures are . % for the price of ordinary -digit plates before and . % after . the percentage drop was . %. ordinary -digit plates priced at a price range similar to that of personalized plates respond to the introduction of personalized plates in a bigger magnitude compared with ordinary -digit plates. the section addresses the dynamics of the value of superstitions. we ask whether people value superstitions differently over time and across different macroeconomic environments. to address this question, we run the following regression on different types of plates. lnðreal priceÞ ¼ a þ bðletter prefix characteristicsÞ þ cðnumber patternsÞ þ dðnumber counts of '' '' to '' ''Þ þ hððnumber counts of '' '' to '' ''Þðmarket conditionÞÞ þ kðyear-month dummiesÞ þ error: the notations b, c, h, d, and k are vectors. we use the natural log of the hong kong hang seng index at the end of the month the plate was auctioned to proxy for the macroeconomic environment. the premium for each number ''j" now is this specification allows the premium or discount of superstitions to vary across different macroeconomic environments. if we expect people to be more superstitious in bad times, they would discount a '' " even more in bad times. as well, they would place a higher premium on an '' " in bad times. in terms of the estimated coefficients, h should be significantly positive, and h should be significantly negative. columns to of table show the results for both special and ordinary -digit and -digit plates, which comprise . % of our total observations. the results suggest that the size of the discount on a '' " is negatively associated with the market index. people tend to discount a '' " even more in bad times. for instance, the discount on a '' " in -digit ordinary plates is on average equal to À . % + . % ln(stock market index). a % drop in the stock index adds an extra . % in the size of the discount. a '' " is bad, but it is even worse in bad times. the results suggest that an '' " is associated with a significant premium. for ordinary and special -digit plates, the premium does not change under different market conditions. for ordinary and special -digit plates, however, the premium tends to be even larger during bad times. for instance, the premium on a '' " in a -digit ordinary plate is on average equal to + . % to . % ln(stock market index). a one percent drop in the stock market index adds an extra . % in the size of the premium. it is interesting to note that the premium on an '' " tends to be more stable than the discount on a '' ". for -digit ordinary plates, the premium on an '' " does not vary with the market condition, while the discount on a '' " does. for ordinary and special -digit plates, the discount on a '' " varies with the market condition in a greater proportion compared with that in the premium on an '' ". this is an interesting empirical pattern that the literature on superstitions has not documented. we are not able to come up with a good explanation for the asymmetric effects. we do not, however, find a similar pattern in -digit and -digit plates, which comprise a little less than % of our data. note: robust standard errors in brackets (clustered by auction date). -, not applicable; no obs, no observation. a hypothesis: whether coefficients of num and num differ signficantly between pre-and post- estimations. * p < : . ** p < : . *** p < : . one may argue that the significance of our results may well be attributed to the number preferences per se rather than to superstitions, that is, people in hong kong simply systematically prefer '' " and dislike '' ". such preferences have nothing to do with superstitions. to address this issue, we control for the number preferences by controlling for the numbers that appear on a plate. we estimate the following equation on -digit ordinary plates: lnðreal priceÞ ¼ a þ bðletter prefix characteristicsÞ þ dðordered combinationsÞ þ kðyear-month dummiesÞ þ error: - the bottom panel of table gives the definition of the variables we use. the idea is to exploit the fact that even for the same set of numbers, what they rhyme similarly to depends on how they are ordered. numbers ordered in such a way that rhymes similarly to some good(bad) phrases would carry a significant premium(discount), which has nothing to do with number preferences anymore. we estimate this equation on -digit ordinary plates separately for four sets of number combinations, ( ) '' ", '' ", and '' ", ( ) '' ", '' ", and '' ", ( ) '' ", '' ", and '' ", and ( ) '' ", '' ", and '' ". the most preferred plate number is '' ", which rhymes similarly to ''proper all the way". controlling for the numbers, if superstitions have value, then '' " would be associated with higher winning bids. column of table shows consistent results. a '' " plate is significantly more expensive than an '' " plate. on average, it was sold for close to three times more expensive than that of a '' " plate because '' " is not associated with any phrase that makes sense in cantonese. the numbers '' ", '' ", '' " and '' " rhyme similarly to ''i will be wealthy my entire life." columns and of table show that they are all associated with plates with higher winning bids. the numbers '' " and '' " rhyme similarly to ''this life is prosperous" and ''business is profitable", respectively. column of table shows that these numbers are again associated with plates with higher winning bids. the results suggest that number preferences alone cannot explain the price variations. it is what a plate number rhymes similarly to that determines its price. thus, plate numbers that allude to good phrases are associated with higher winning bids. to check whether our results also hold in the secondary market, we collect listed prices for license plates from a secondary market seller. a caveat is that we only collect listed prices and not the actual transacted prices. we managed to collect up to eight days of listed prices of car plates spanning from june to august . altogether we have , plates with their listed prices across the eight days, ( %) of them are -digit ordinary plates and ( %) are ordinary -digit plates. we estimate the same specifications as we have done in the previous sections. table gives the estimation results. columns and of table illustrate that the results in table are robust. first, we continue to find that ''lucky" number '' " carries a significant premium, while ''unlucky" number '' " carries a significant discount. controlling for other factors, an ordinary -digit plate with one extra '' " was listed at . % higher on average, while an extra '' " was listed at . % lower. the corresponding estimates for an ordinary -digit plate are . % and . %, respectively. second, consistent with the results in section . . , an '' " commands a higher premium on a -digit plate than on a -digit plate. similarly, a '' " is discounted more in a -digit plate than in a -digit plate. columns - of table check whether the results in section . . are robust. the estimates suggest that after the introduction of personalized plate in , the discount of a '' " in both ordinary -digit and -digit plates drop. the premium on note: robust standard errors in brackets (clustered by date). -not applicable, no obs no observation. * p < : . ** p < : . *** p < : . the number '' " rhymes similarly to ''all the way" (as an adverb); thus, '' " rhymes similarly to ''prosper all the way". both '' " and '' " are not associated with any phrase that makes sense in cantonese. however, '' ", '' ", and '' " are all variations, in terms of what they mean, of ''road to prosperity". again, '' ", '' ", '' ", and '' " do not rhyme similarly to any phrase that makes sense in cantonese. we rely on the seller's website as well as the internet archive (web.archive.org) to collect the data. we have contacted the seller to obtain more historical data and the transacted prices, but the seller refused to provide us with their transacted prices. the eight days are june , , december , , august , , october , , february , , april , , august , , and august , of the , listed prices we collected, surprisingly, . % (or ) of them are special plates. we deleted these plates because, as mentioned, it is by law not legal to transfer special plates. we therefore suspect that the prices involve not only the values of plates but also the service fees to help ''get around" with the non-transferable rule. in another specification, we successfully match , ordinary -digit plates and ordinary -digit plates with the auction data (i.e., % of the data). we run the same specification as in columns and of table , with the auction price as an extra control variable. table shows the estimation results. they suggest that controlling for all the variables, the past auction price of a plate continues to be a significant price determinant (statistically significant at the % level.) a higher auction price, everything else being equal, leads to a higher listed price in the second-hand market. however, there is still a significant premium on an '' ". for a '' ", the discount is statistically significant for ordinary -digit plates but not for -digit plates. an '' " in ordinary -digit plates drops too. however, the premium on an '' " in ordinary -digit plates is roughly the same across the two periods. except for the premium on an '' " in -digit plates, the results are consistent with those in table . . *** section . . argues that the introduction of personalized plate affects the premiums and discounts on numbers in ordinary -digit plates more than those in ordinary -digit plates, as personalized plates are more of the closer substitutes to ordinary -digit plates than to ordinary -digit ones. we continue to observe this in table . the change we look into is the premium on replacing a '' " with an '' " on a plate. on average, such number replacement adds . % to the price of ordinary -digit plates before but only . % after , which is a percentage drop of . %. the corresponding figures are . % to the price of ordinary -digit plates before and . % after . the percentage drop was . %. ordinary digit plates respond to the introduction of personalized plates in a bigger magnitude compared with ordinary -digit plates. columns and re-estimate specification ( ) . for ordinary -digit plates, we continue to find that the people discount '' " even more in bad times and value '' " more in bad times. this is consistent with results in section . . . we have the right sign of the estimated coefficients for ordinary -digit plates, but they are statistically insignificant. overall, we find that the listed prices of secondary market plate sellers exhibit very similar statistical patterns compared with the auction data, which lends further support to our main results. this paper estimates the value of a particular type of superstitions: a ''lucky"(''unlucky") number can bring good(bad) luck. we have shown that the value of superstitions can be economically significant. we believe that the results are consistent with the fact that superstitions persist over time. although we may not be the first to document the value of superstitions, we are the first in the economics literature to address the question of how such value changes over time, and in response to other policy changes. interestingly, we find that the value of superstitions changes in ways that are consistent with economic intuition. the dataset we obtain and the exogenous policy change provide us a unique opportunity to address this issue. we have also shown that some results are consistent with the view that people tend to be more superstitious in bad times. our results suggest that people tend to discount a bad number even more in bad times. however, people place a higher premium on a good number in bad times only for a small subset of plates. we conjecture that the value people attach to superstitions change across different macroeconomic environments, but the changes are asymmetric in positive and negative superstitions. by positive superstitions, we mean the false belief that some logically unrelated items or actions bring good luck. negative superstitions are the opposite case. we are not aware of any theory that distinguishes positive superstitions from negative superstitions. to conclude, while our empirical analysis documents the value of superstitions and how it changes over time and in response to exogenous changes, it also calls for explicit modeling of different types of superstitions in order to understand the empirical findings. we hope that our study would motivate theoretical research on this particular issue. $ , $ , $ , $ , $ , $ , $ the data on reserve prices of special plates are not publicly available. for other data, the transport department claimed that they did not keep historical records for instance, for the study of conspicuous effects of license plates in the united states note: robust standard errors in brackets economic insights from internet auctions the effect of superstitious beliefs on performance expectations hedonic prices and house numbers: the influence of feng shui a bandwagon effect in personalized license plates? economic inquiry the pricing of luckiness in the apartment market hedonic pricing models for vehicle registration marks hedonic price indexes with automotive examples the evolution of superstitious and superstition-like behaviour speculation in standard auctions with resale conscious and nonconscious components of superstitious beliefs in judgment and decision making a new approach to consumer theory reputation in an internet auction superstition, risk-taking and risk perception of accidents among south african taxi drivers the hound of the baskervilles effect: natural experiment on the influence of psychological stress on timing of death superstition in the pigeon hedonic prices and implicit markets: product differentiation in pure competition do dragon have better fate? economic inquiry vanity, superstition and auction price willingness to pay and nuanced cultural cues: evidence from hong kong's license-plate auction market the authors would like to thank the editor and the two anonymous referees for their helpful comments. we would also like to thank jiahua che and tat-kei lai for their helpful comments, and thank victor kong for his able research assistance. we appreciate the help of the transport department of hong kong sar government for providing us the data. key: cord- -hk bzqm authors: cintia, paolo; fadda, daniele; giannotti, fosca; pappalardo, luca; rossetti, giulio; pedreschi, dino; rinzivillo, salvo; bonato, pietro; fabbri, francesco; penone, francesco; savarese, marcello; checchi, daniele; chiaromonte, francesca; vineis, paolo; guzzetta, giorgio; riccardo, flavia; marziano, valentina; poletti, piero; trentini, filippo; bella, antonino; andrianou, xanthi; manso, martina del; fabiani, massimo; bellino, stefania; boros, stefano; urdiales, alberto mateo; vescio, maria fenicia; brusaferro, silvio; rezza, giovanni; pezzotti, patrizio; ajelli, marco; merler, stefano title: the relationship between human mobility and viral transmissibility during the covid- epidemics in italy date: - - journal: nan doi: nan sha: doc_id: cord_uid: hk bzqm we describe in this report our studies to understand the relationship between human mobility and the spreading of covid- , as an aid to manage the restart of the social and economic activities after the lockdown and monitor the epidemics in the coming weeks and months. we compare the evolution (from january to may ) of the daily mobility flows in italy, measured by means of nation-wide mobile phone data, and the evolution of transmissibility, measured by the net reproduction number, i.e., the mean number of secondary infections generated by one primary infector in the presence of control interventions and human behavioural adaptations. we find a striking relationship between the negative variation of mobility flows and the net reproduction number, in all italian regions, between march th and march th, when the country entered the lockdown. this observation allows us to quantify the time needed to"switch off"the country mobility (one week) and the time required to bring the net reproduction number below (one week). a reasonably simple regression model provides evidence that the net reproduction number is correlated with a region's incoming, outgoing and internal mobility. we also find a strong relationship between the number of days above the epidemic threshold before the mobility flows reduce significantly as an effect of lockdowns, and the total number of confirmed sars-cov- infections per k inhabitants, thus indirectly showing the effectiveness of the lockdown and the other non-pharmaceutical interventions in the containment of the contagion. our study demonstrates the value of"big"mobility data to the monitoring of key epidemic indicators to inform choices as the epidemics unfolds in the coming months. understanding the relationship between human mobility patterns and the spreading of covid- is crucial to the restart of social and economic activities, limited or put in "stand-by" during the national lockdown to contain the diffusion of the epidemics, and to monitor the risk of a resurgence during the current phase , or lockdown exit. recent analyses document that, following the national lockdown of march th, the mobility fluxes in italy have significantly decreased by % or more, everywhere in the country, as studied in our previous report [ ] and [ , ] . in this report we study the relation between human mobility and sars-cov- transmissibility before, during and after the national lockdown. we compare the flows of people between and within italian regions with the net reproduction number r t , i.e., the mean number of secondary infections generated by one primary infector in the presence of control interventions and human behavioural adaptations. to pursue this goal, we use mobile phone data at national scale to reconstruct the self-, in-and out-flows of italian regions before and during the national lockdown (initiated on march th, ), after the closure of non-essential productive and economic activities (march th, ), and after the partial restart of economic activities and within-region movements (the "phase ", from may th, ). in this report, we address the following analytical questions: • how does the net reproduction number vary in relation to the variation of mobility flows? • what differences, if any, do we observe across the italian regions? • can we relate the delay in limiting human mobility with the rate of positive covid- cases across the population? the answers to these questions are highlighted in the next sections. an interactive, dynamically updated version of this report is available at http://sobigdata.eu/covid_report/#/report page in this report, we rely on mobile phone data, which have proven to be a useful data source to track the time evolution of human mobility [ , , ] , and thus a tool for monitoring the effectiveness of control measures such as movement restrictions and physical distancing [ , , ] . specifically, the raw data used in this report are the result of normal service operations performed by the mobile operator windtre a : cdrs (call detail records) and xdrs (extended detail records). in both cases, the fundamental geographical unit is the "phone cell" defined as the area covered by a single antenna, i.e., the device that captures mobile radio signals and keeps the user connected with the network. multiple antennas are usually mounted on the same tower, each covering a different direction. the position of the tower (expressed as latitude and longitude) and the direction of the antenna allow inferring the extension of the corresponding phone cell. the position of caller and callee is approximated by the corresponding antenna serving the call, whose extension is relatively small in urban contexts (in the order of m x m) and much larger in rural areas (in the order of km x km or more). based on this configuration, cdrs describe the location of mobile phone users during call activities and xdrs their location during data transmission for internet access. the information content provided by standard cdr and xdr is the following: in both cdrs and xdrs, the identity of the users is replaced by artificial identifiers. the correspondence between such identifiers and the real identities of the users is known only to the mobile phone operator, who might use it in case of necessity. this pseudonymization procedure is a first important step (mentioned in article ( ) and article ( ) of the gdpr, the eu general data protection regulation) to provide anonymity [ , , ] and it will then turn into totally anonymous data for the possible treatment data use. for the analyses in this report, we used aggregated data computed by the mobile operator covering the period january th, to may th, . for each phone call, a tuple is recorded, where n o and n i are pseudo-anonymous identifiers, respectively of the "caller" and the "callee"; t is a timestamp saying when the call was placed; a s and a e are the identifiers of the towers/antennas to which the caller was connected at the start and end of the call; finally, d is the call duration (e.g., in minutes). they are similar to cdrs, except that the communication is only between the antenna and the connected mobile phone, and an amount k of kilobytes is downloaded in the process. the format of xdr is, therefore, a tuple . a windtre is one of the main mobile phone operators in italy, covering around % of the residential "human" mobile market. page origin-destination matrices -for a better matching with the available covid- data (number of positive cases and net reproduction number), we aggregated the municipality-to-municipality origin-destination matrices (ods) into province-to-province or region-to-region ods, in which each node represents an italian province or region. in particular, for each day, we compute both the out-flows, indicating the total number of people moving from a province/region to any other province/region, and the in-flows, indicating the total number of people moving to a province/ region from any other province/region. the trips between municipalities of the same province/region are aggregated into a self-flow, which indicates the province/region's internal mobility. for privacy reasons, we eliminate all out-, in-and self-flows with values lower than . as they are calculated by the operator, we store the daily municipality-to-municipality od matrices and the daily region-to-region ones into a relational dbms and access them through calls to a dedicated api. we normalize the self-, in-and out-flows by multiplying them by coefficients provided by the mobile phone operator, which indicate an estimation of market share for every municipality. after this transformation, we have an estimation of the real size of the mobility flow between each origin and destination municipality. for ease of readability, figures , and visualize the od matrix of flows between italian regions on february th (before the initiation of the national lockdown on march th), march th (during the lockdown), and may th (during phase ), respectively. we find that the od matrix becomes significantly more sparse during the lockdown and the phase , denoting a drastic reduction of the routes between italian regions. numerically, we estimate this sparsity through the network density, i.e., the proportion of the potential connections in a network that are actual connections. we find that network density halves during the lockdown: it decreases from d feb = . to d mar = . , indicating that the lockdown erases half of the possible connections between regions compared to the previous period. network density remains almost unchanged between the lockdown and the phase (d may = . ), presumably because movements between regions are still forbidden by law except for specific circumstances (e.g. commuting for work). these results clearly highlight the drastic change in the structure of the human mobility network between regions in the two periods taken into consideration. indeed, most of the regional out-flows go towards adjacent regions. for example, before the lockdown, most of lombardy's out-flow is directed towards veneto, piedmont and emilia-romagna (adjacent regions), and the rest of the out-flows distribute more or less uniformly across all other regions, both in the north and the south (figure ). in contrast, during the lockdown, the number of these more modest out-flows decreases substantially, and most of them disappear at all ( figure ). the width of the arrows is proportional to the flow between the two regions. the density of the flows network is d feb = . . on the left, numbers in parenthesis indicate the out-flow. on the right, numbers in parenthesis indicate the in-flow. the relationship between human mobility and viral transmissibility during the covid- epidemics in italy before lockdown the width of the arrows is proportional to the flow between the two regions. the density of the flows network is d mar = . . on the right, numbers in parenthesis indicate the in-flow. the width of the arrows is proportional to the flow between the two regions. the density of the flows network is d may = . . on the right, numbers in parenthesis indicate the in-flow. the relationship between human mobility and viral transmissibility during the covid- epidemics in italy flows between italian regions -another important aspect of the mobility of a region or a province, complementary to the volumes of incoming and outgoing flows, is the diversification of the provenance and the destination of people. specifically, we define the in-flow diversity of a province a as the shannon entropy of the in-flows to the province [ ] where p in is the number of provinces with non-null flow to province a, p(x) is the probability that the in-flow to province a comes from province x, and log(n) is a normalization factor where n= is the number of italian provinces. the outflow diversity of province a is computed similarly as: where p out is the number of provinces with non-null flow from province a, and p(x) is the probability that the out-flow from province a goes to province x. mobility diversity during the pre-lockdown and lockdown period has been studied in our first report [ ] . another important aspect of the mobility of a region or a province, complementary to the volumes of incoming and outgoing flows, is the diversification of the provenance and the destination of people. specifically, we define the in-flow diversity of a province as the shannon entropy of the in-flows to the a province [ ] : where is the number of provinces with non-null flow to province , is the probability that the p in a (x) p in-flow to province comes from province , and is a normalization factor where is the a x og(n ) l n = number of italian provinces. the out-flow diversity of province is computed similarly as: where is the number of provinces with non-null flow from province , and is the probability p out a (x) p that the out-flow from province goes to province . mobility diversity during the pre-lockdown and a x lockdown period has been studied in our first report [ ] . we compare the evolution of the out-, in-and self-flows with the evolution of the daily disease transmissibility in italian regions, measured in terms of the net reproduction number . the net r t reproduction number represents the mean number of secondary infections generated by one primary infector, in the presence of control interventions and human behavioural adaptations. when decreases r t below the epidemic threshold of , the number of new infections begins to decline. the estimates of were r t computed from the daily time series of new cases by date of symptom onset. case-based surveillance data used for estimating were collected by regional health authorities and collated by the istituto superiore di r t sanità using a secure online platform, according to a progressively harmonized track-record. data include, among other information, the place of residence, the date of symptom onset and the date of first hospital admission for laboratory-confirmed covid- cases [ ] . the distribution of the net reproduction number was estimated by applying a well-established statistical method [ ] [ ] [ ] , which is based on the r t knowledge of the distribution of the generation time and on the time series of cases. in particular, the posterior distribution of for any time point was estimated by applying the metropolis-hastings mcmc r t t sampling to a likelihood function defined as follows: another important aspect of the mobility of a region or a province, complementary to the volumes of incoming and outgoing flows, is the diversification of the provenance and the destination of people. specifically, we define the in-flow diversity of a province as the shannon entropy of the in-flows to the a province [ ] : where is the number of provinces with non-null flow to province , is the probability that the p in a (x) p in-flow to province comes from province , and is a normalization factor where is the a x og(n ) l n = number of italian provinces. the out-flow diversity of province is computed similarly as: a where is the number of provinces with non-null flow from province , and is the probability p out a (x) p that the out-flow from province goes to province . mobility diversity during the pre-lockdown and a x lockdown period has been studied in our first report [ ] . we compare the evolution of the out-, in-and self-flows with the evolution of the daily disease transmissibility in italian regions, measured in terms of the net reproduction number . the net r t reproduction number represents the mean number of secondary infections generated by one primary infector, in the presence of control interventions and human behavioural adaptations. when decreases r t below the epidemic threshold of , the number of new infections begins to decline. the estimates of were r t computed from the daily time series of new cases by date of symptom onset. case-based surveillance data used for estimating were collected by regional health authorities and collated by the istituto superiore di r t sanità using a secure online platform, according to a progressively harmonized track-record. data include, among other information, the place of residence, the date of symptom onset and the date of first hospital admission for laboratory-confirmed covid- cases [ ] . the distribution of the net reproduction number was estimated by applying a well-established statistical method [ ] [ ] [ ] , which is based on the r t knowledge of the distribution of the generation time and on the time series of cases. in particular, the posterior distribution of for any time point was estimated by applying the metropolis-hastings mcmc r t t sampling to a likelihood function defined as follows: we compare the evolution of the out-, in-and self-flows with the evolution of the daily disease transmissibility in italian regions, measured in terms of the net reproduction number r t . the net reproduction number represents the mean number of secondary infections generated by one primary infector, in the presence of control interventions and human behavioural adaptations. when r t decreases below the epidemic threshold of , the number of new infections begins to decline. the estimates of r t were computed from the daily time series of new cases by date of symptom onset. case-based surveillance data used for estimating r t were collected by regional health authorities and collated by the istituto superiore di sanità using a secure online platform, according to a progressively harmonized track-record. data include, among other information, the place of residence, the date of symptom onset and the date of first hospital admission for laboratory-confirmed covid- cases [ ] . the distribution of the net reproduction number r t was estimated by applying a well-established statistical method [ ] [ ] [ ] , which is based on the knowledge of the distribution of the generation time and on the time series of cases. in particular, the posterior distribution of r t for any time point t was estimated by applying the metropolis-hastings mcmc sampling to a likelihood function defined as follows: where p(κ;λ) is the probability mass function of a poisson distribution (i.e., the probability of observing κ events if these events occur with rate λ). c(t) is the daily number of new cases having symptom onset at time t; r t is the net reproduction number at time t to be estimated; φ(s) is the probability distribution density of the generation time evaluated at time s. as a proxy for the distribution of the generation time, we used the distribution of the serial interval, estimated from the analysis of contact tracing data in lombardy [ ] , i.e., a gamma function with shape . and rate . , having a mean of . days. this estimate is within the range of other available estimates for sars-cov- infections, i.e. between and . days [ ] [ ] [ ] . we compare the evolution of the out-, in-and self-flows with the evolution of the daily disease transmissibility in italian regions, measured in terms of the net reproduction number . the net r t reproduction number represents the mean number of secondary infections generated by one primary infector, in the presence of control interventions and human behavioural adaptations. when decreases r t below the epidemic threshold of , the number of new infections begins to decline. the estimates of were r t computed from the daily time series of new cases by date of symptom onset. case-based surveillance data used for estimating were collected by regional health authorities and collated by the istituto superiore di r t sanità using a secure online platform, according to a progressively harmonized track-record. data include, among other information, the place of residence, the date of symptom onset and the date of first hospital admission for laboratory-confirmed covid- cases [ ] . the distribution of the net reproduction number was estimated by applying a well-established statistical method [ ] [ ] [ ] , which is based on the r t knowledge of the distribution of the generation time and on the time series of cases. in particular, the posterior distribution of for any time point was estimated by applying the metropolis-hastings mcmc r t t sampling to a likelihood function defined as follows: is the probability mass function of a poisson distribution (i.e., the probability of observing (κ; ) p λ κ events if these events occur with rate ). λ • is the daily number of new cases having symptom onset at time ; is the net reproduction number at time to be estimated; is the probability distribution density of the generation time evaluated at time . φ (s) s as a proxy for the distribution of the generation time, we used the distribution of the serial interval, estimated from the analysis of contact tracing data in lombardy [ ] , i.e., a gamma function with shape . and rate . , having a mean of . days. this estimate is within the range of other available estimates for sars-cov- infections, i.e. between and . days [ ] [ ] [ ] . the number of covid- positive cases is provided by protezione civile, the italian public institution in charge of monitoring the covid- emergency. they collect data from every italian administrative region and make them available on a public github repository [ ] . for each region, we focus on the number of new positive cases per day. specifically, given a day , we compute the average of values over the four g days before and the four days after . g the relationship between human mobility and viral transmissibility during the covid- epidemics in italy page net reproduction number r t epidemiologic data - figure and show the evolution of the mobility self-flows (blue curves), the net reproduction number (orange curves) and the number of positive cases (grey curves) for the northern regions and central-southern regions, respectively. these curves reveal numerous interesting insights. all regions have a net decrease of the self-flow soon after the first national lockdown (march th [ ] ). the flows stabilized on the new, reduced volume after about one week. subsequent restriction ordinances, such as the closing of non-essential economic activities on march th [ ] , had a minor impact on the reduction of self-flows. for almost all regions, we find an increase of the self-flow since the start of phase on may th. this behaviour is particularly pronounced for emilia-romagna, toscana, puglia, and lazio. further investigation is needed to understand why these regions had such a marked increase. interestingly, we find a slight increase in the self-flows approaching may th, the starting of "phase " during which a wider range of movement within regions has been allowed by the government. we interpret this result as a progressive, although slight, relaxation of compliance with the mobility limitations imposed by the lockdown. the case of molise is different and particularly compelling: it is indeed the only region for which the self-flow decreases since may th ( figure ). the reason for this decrease may be due to news media coverage about a funeral on april th, attended by a large number of people, which resulted in a large local outbreak. this may have induced parts of the molise population to self-restrict movements during the following days . the relationship between human mobility and viral transmissibility during the covid- epidemics in italy page the number of covid- positive cases is provided by protezione civile, the italian public institution in charge of monitoring the covid- emergency. they collect data from every italian administrative region and make them available on a public github repository [ ] . for each region, we focus on the number of new positive cases per day. specifically, given a day g, we compute the average of values over the four days before and the four days after g. https://www.ilfattoquotidiano.it/ / / /coronavirus-nuovo-focolaio-in-molise- -contagi-a-campobasso-legati-a-un-funerale-della-comunita-rom/ / although the date when r t = for the first time varies from region to region, for all regions r t decreases concurrently with the net decrease of self-flows due to the beginning of the national lockdown (figures and ) ., highlighting the importance of the government intervention. note that r t starts taking values lower than since march th, when the self-flows stabilize on the new, reduced volume. from that moment on, self-flows remain stable. still, the r t continues decreasing. this may be due to other ordinances by local and national governments related to the wearing of masks and gloves in public areas, social distancing and ban on gatherings --and possibly to other factors to be further investigated evolution between january th and may th of self-flow (blue curve), net reproduction number r t (orange curve) and moving average of the number of confirmed sars-cov- infections (grey curve) in the northern regions of italy. for each day, we plot the average of the r t values of the three days before and after that day. the orange-shaded area indicates r t > . the value of the grey curve for a given day is computed as the average of the number of confirmed sars-cov- infections over the four days before and the four days after that day. the vertical dashed lines indicate the beginning of the national lockdown (ld, march th, ), the closing of non-essential economic activities (cna, march th, ) and the partial restarting of economic activities and within-region movements ("ph ", may th, ). the area in white indicates the period before mobility reduction (mr) in that region. note that the beginning of mr does not necessarily coincide with the national lockdown (e.g see lombardia) . evolution between january th and may th of self-flows (blue curve), net reproduction numbers (orange curve) and moving average of the number of confirmed sars-cov- infections (grey curve) in the central-southern regions of italy. for each day, we plot the average of the r t values of the three days before and after that day. the orange-shaded area indicates r t > . the value of the grey curve for a given day is computed as the average of the number of confirmed sars-cov- infections over the four days before and the four days after that day. the vertical dashed lines indicate the beginning of the national lockdown (ld, march th, ), the closing of non-essential economic activities (cna, march th, ) and the partial restarting of economic activities and within-region movements ("ph ", may th, ). the area in white indicates the period before mobility reduction (mr) in that region. note that the beginning of mr does not necessarily coincide with the national lockdown. can we estimate the value of r t of italian regions (or provinces) from the mobility of their population? this is a complex question, which we address here only preliminarily --focusing on in-and out-flow diversity, switching from the regional level used for the analysis to the level of provinces. a first cut regression model for estimating the daily r t as a function of mobility diversity is specified as follows: the beginning of the national lockdown (ld, march th, ), the closing of non-necessary economic activities (cna, march th, ) and the partial restarting of economic activities and within-region movements ("ph ", may th, ). the area in white indicates the period in which there is a drastic decrease of both self-flows and net reproduction numbers. can we estimate the value of of italian regions (or provinces) from the mobility of their population? this is r t a complex question, which we address here only preliminarily --focusing on in-and out-flow diversity, switching from the regional level used for the analysis to the level of provinces. a first cut regression model for estimating the daily as a function of mobility diversity is specified as follows: indicates the fixed effect of province (to control for the non-observable heterogeneity between α i i the provinces), and are the daily in-and out-flow diversities and for indiversity it outdiversity it (i) e in (i) e out province on the same day , indicates the fixed effect of day , are stochastic errors residuals, and i t δ t t ε it (the outcome variable) is the net reproduction number estimated for day and province . we considered of italian provinces for which a sufficient number of symptomatic cases had been recorded for a reliable computation of the estimate. as an outcome of regressing the daily on the in-and r t out-flow diversities of the same day, we find that contributes to reduce the and outdiversity it r t indiversity it to increase it, but these effects are not statistically significant. the picture changes substantially if we introduce time lags, e.g., through the model: in which the regressors cover the entire week before the measurement of . in table (column ) , we r t consider the fixed effects of province only, while in column we add the fixed effect for day. for , we j = find that contributes to increase the and to reduce it, with statistical indiversity it− r t outdiversity it− significance at %. if we increase the lagging period past one week ( , we find a stronger statistical significance (table , ≥ ) j columns and ). for out-flow diversity, the closer the day is to the date of the ( ) the stronger the r t j < impact on contagion. conversely, for in-flow diversity, the further away the day is to the date of the ( r t ≥ j ) the stronger the impact on contagion. figure reports the temporal profile of the coefficients estimated in column of where α i indicates the fixed effect of province i (to control for the non-observable heterogeneity between the provinces), indiversity it and outdiversity it are the daily in-and out-flow diversities e in (i) and e out (i) for province i on day t, δ t indicates the fixed effect of day t, ε it are stochastic errors residuals, and r it (the outcome variable) is the net reproduction number estimated for day t and province i. we considered of italian provinces for which a sufficient number of symptomatic cases had been recorded for a reliable computation of the estimate. as an outcome of regressing the daily r t on the in-and out-flow diversities of the same day, we find that outdiversity it contributes to reduce the r t and indiversity it to increase it, but these effects are not statistically significant. the picture changes substantially if we introduce time lags, e.g., through the model: the picture changes substantially if we introduce lags, e.g. through the model: the beginning of the national lockdown (ld, march th, ), the closing of non-necessary economic activities (cna, march th, ) and the partial restarting of economic activities and within-region movements ("ph ", may th, ). the area in white indicates the period in which there is a drastic decrease of both self-flows and net reproduction numbers. can we estimate the value of of italian regions (or provinces) from the mobility of their population? this is r t a complex question, which we address here only preliminarily --focusing on in-and out-flow diversity, switching from the regional level used for the analysis to the level of provinces. a first cut regression model for estimating the daily as a function of mobility diversity is specified as follows: indicates the fixed effect of province (to control for the non-observable heterogeneity between α i i the provinces), and are the daily in-and out-flow diversities and for indiversity it outdiversity it (i) e in (i) e out province on the same day , indicates the fixed effect of day , are stochastic errors residuals, and i t δ t t ε it (the outcome variable) is the net reproduction number estimated for day and province . we considered of italian provinces for which a sufficient number of symptomatic cases had been recorded for a reliable computation of the estimate. as an outcome of regressing the daily on the in-and r t out-flow diversities of the same day, we find that contributes to reduce the and outdiversity it r t indiversity it to increase it, but these effects are not statistically significant. the picture changes substantially if we introduce time lags, e.g., through the model: in which the regressors cover the entire week before the measurement of . in table (column ) , we r t consider the fixed effects of province only, while in column we add the fixed effect for day. for , we j = find that contributes to increase the and to reduce it, with statistical indiversity it− r t outdiversity it− significance at %. if we increase the lagging period past one week ( , we find a stronger statistical significance (table , ≥ ) j columns and ). for out-flow diversity, the closer the day is to the date of the ( ) the stronger the r t j < impact on contagion. conversely, for in-flow diversity, the further away the day is to the date of the ( r t ≥ j ) the stronger the impact on contagion. figure reports the temporal profile of the coefficients estimated in column of in which the regressors cover the entire week before the measurement of r t . in table (column ), we consider the fixed effects of province only, while in column we add the fixed effect for day. for j= , we find that indiversity it- contributes to increase the r it and outdiversity it- to reduce it, with statistical significance at %. if we increase the lagging period past one week (j ≥ ), we find a stronger statistical significance (table , columns and ). for out-flow diversity, the closer the day is to the date of the r t (j< ) the stronger the impact on contagion. conversely, for in-flow diversity, the further away the day is to the date of the r t (j ≥ ) the stronger the impact on contagion. figure reports the temporal profile of the coefficients estimated in column of robust standard errors in brackets -*** p< . , ** p< . , * p< . robust errors clustered by provincecountry fixed effects included. the relationship between human mobility and viral transmissibility during the covid- epidemics in italy influence of variables on disease transmissibility, measured as the net reproduction number r t with entropic measures we cannot control for movements within each province. if we replace mobility diversity with the in-flows, out-folws and self-flows of each province (number of individuals registered as either changing provinces or moving within the province), we find similar statistical significance for mobility in the previous week, confirming that outflow of people contributes to contagion reduction, while arrival of people from outside the province or internal mobility raises it. while a more detailed modeling of the association between contagion and mobility is certainly needed, and may lead to additional insights, these preliminary results --together with the analysis of self-flows presented in the previous section, provide clear evidence for a critical role of human mobility in the spatio-temporal unfolding of the epidemics. we go one step further in our analysis of the relationships between mobility flows and contagion by computing two quantities, for each region: ) the delay in mobility reduction, i.e., the number of days in which r t > before the mobility flows of a region decrease by at least % w.r.t. the usual (pre-epidemics) weekly mobility, observed over january and the first two weeks of february, and ) the total number of reported sars-cov- infections per k inhabitants in the region (as of may th, ). the date of mobility reduction below % for each region is indicated as the mr black vertical line in the time series of figures and . figure shows a scatter plot of the two quantities for all regions, where the size of the circle of each region is proportional to the total number of reported infections in the observation period; the positive correlation between the two quantities is robust (pearson coefficient = . , p < . , r = . ), suggesting that larger delays could have induced heavier spreading of the virus. this is a strong evidence that timely lockdowns are instrumental for better containment of the contagion. two further considerations follow. the mobility reduction in lombardy started around days after the first day in which r t > , leading to the highest number of positive cases per inhabitant in italy. similarly, for other regions severely affected by the virus, such as liguria, emilia-romagna and piedmont, the mobility reduction started around, respectively, and days after the first day in which r t > . the central-western regions in north italy regions lie above the dashed regression line, in the top right part of the plot. the regions below this line, in the bottom right part of the plot, were more effective than the regions above the line in containing the contagion, despite the delay in lockdown of days or more, such as veneto, lazio, and tuscany. this fact may be explained by several factors, including the effectiveness of the epidemic surveillance, the intensity of the testing and tracing strategy adopted, the capacity of outbreak containment, and also the absolute number of cases when r t jumps above . on the other hand, for southern regions the mobility reduction started with around days of delay (molise, basilicata) or around days (campania, puglia, sicily, calabria), which presumably was effective in containing the spread of the virus: all southern regions (with the only exception of molise) are below the regression line in the bottom left part of the plot (low number of infections per k inhabitants). central-southern regions are the ones who benefited the most from the lockdown, presumably because it started more timely. this brings further evidence of the effectiveness of the lockdown. in the scatter plot, the horizontal axis has the number of days between the first time r t > and the beginning of the national lockdown. the vertical axis has the cumulative incidence of confirmed sars-cov- infections per k inhabitants (as of may th, ). the size of the circles is proportional to the total number of positive cases in the period (pearson coefficient= . , p< . , r = . ). the relationship between human mobility and viral transmissibility during the covid- epidemics in italy page our combined analysis of mobility and epidemics highlighted a striking relation between the negative variation of movement fluxes and the negative variation of the net reproduction number, in all italian regions, in the time interval of approximately one week, from march th till march th, during the transition between the two mobility modalities. during this week, the two curves gracefully overlap; at the end of this week, the country has reached a new "stable" mobility regime, approximately at % of the pre-lockdown level. the two curves exhibit the same pattern everywhere, both at regional and provincial level, with minimal temporal lags. we call this phenomenon, represented schematically in figure , the "epi-mob" pattern. mobility, the blue curve, is a "switch" between two very different levels, before and during the lockdown, with an exponential fall from the first to the second; the epidemics is a peaked distribution with an exponential growth and fall, overlapping with the "switch" during the fall. the presumable effectiveness of the lockdown for the containment of the epidemics is further substantiated by the pattern in figure , relating the timeliness of lockdown in every region with the total number of infected individuals per k inhabitants. we can also quantify the time needed to "switch off" the country mobility (approx. week to reach the new lower regime) and the time needed to bring the net reproduction number below (again, approx. week). notice that r t continues to slowly decrease during lockdown, and also that at the beginning of phase (lockdown exit), when mobility begins to rise again, r t does not jump into a new uncontrolled growth, at least until may th, the last day of observation in this report. this is probably due to the non-pharmaceutical interventions in place, including the increased compliance of people to use personal protective equipment and respect social distancing (compared to the pre-lockdown phase). another factor limiting the increase of r t in the phase could also be related to an increased ability to trace, test and isolate infected individuals. we have also shown that a simple regression model provides reasonable estimates of the r t in a given region (or province) as a function of the in-and out-flows. clearly, the accuracy of the estimation is influenced by the changes in containment interventions and in citizens' behaviour towards prevention measures. consequently, learning a regression model for r t during pre-lockdown and early lockdown and applying the model for predicting the r t during phase would probably lead to overestimating it; however, such worstcase scenario might be useful for comparison with the actual r t measured, as a mean to evaluate the effectiveness of the containment policy in act and citizens' social behavior. an interesting point that calls for further study is how to continuously learn an estimation model for the r t , whose accuracy is continuously monitored, to the purpose of nowcasting the r t , shortening as much as possible the lag of time to wait until the r t becomes known. as a conclusion, we believe that this study demonstrated the value of "big" mobility data, a detailed proxy of human behavior available every day in real time, to the purpose of refining our understanding of the dynamics of the epidemics, reasoning on the effectiveness of policy choices for non-pharmaceutical interventions and on citizens' compliance to social distancing measures, and help monitoring key epidemic indicators to inform choices as the epidemics unfolds in the coming months. during the first week of lockdown, the two curves describing mobility flows and net reproduction number gracefully overlap. at the end of this week, the country has reached a new "stable" mobility regime, approximately at % of the pre-lockdown level. the two curves exhibit the same pattern everywhere, both at regional and provincial level, with minimal temporal lags. the relationship between human mobility and viral transmissibility during the covid- epidemics in italy page appendix (mobility inflow) - mar apr may for each day, we plot the average of the r t values of the three days before and after that day. the orange-shaded area indicates r t > . the value of the grey curve for a given day is computed as the average of the number of confirmed sars-cov- infections over the four days before and the four days after that day. the vertical dashed lines indicate the beginning of the national lockdown (ld, march th, ), the closing of non-essential economic activities (cna, march th, ) and the partial restarting of economic activities and within-region movements ("ph ", may th, ). the area in white indicates the period before mobility reduction (mr) in that region. note that the beginning of mr does not necessarily coincide with the national lockdown. evolution between january th and may th of out-flow (blue curve), net reproduction number r t (orange curve) and moving average of the number of confirmed sars-cov- infections (grey curve) in the northern regions of italy. for each day, we plot the average of the r t values of the three days before and after that day. the orange-shaded area indicates r t > . the value of the grey curve for a given day is computed as the average of the number of confirmed sars-cov- infections over the four days before and the four days after that day. the vertical dashed lines indicate the beginning of the national lockdown (ld, march th, ), the closing of non-essential economic activities (cna, march th, ) and the partial restarting of economic activities and within-region movements ("ph ", may th, ). the area in white indicates the period before mobility reduction (mr) in that region. note that the beginning of mr does not necessarily coincide with the national lockdown (e.g. see lombardia). evolution between january th and may th of out-flows (blue curve), net reproduction numbers (orange curve) and moving average of the number of confirmed sars-cov- infections (grey curve) in the central-southern regions of italy. for each day, we plot the average of the r t values of the three days before and after that day. the orange-shaded area indicates r t > . the value of the grey curve for a given day is computed as the average of the number of confirmed sars-cov- infections over the four days before and the four days after that day. the vertical dashed lines indicate the beginning of the national lockdown (ld, march th, ), the closing of non-essential economic activities (cna, march th, ) and the partial restarting of economic activities and within-region movements ("ph ", may th, ). the area in white indicates the period before mobility reduction (mr) in that region. note that the beginning of mr does not necessarily coincide with the national lockdown. mobile phone data and covid- : missing an opportunity aggregated mobility data could help fight covid- measuring levels of activity in a changing city: a study using cellphone data streams on the privacy-conscientious use of mobile phone data a survey of results on mobile phone datasets analysis returners and explorers dichotomy in human mobility primule: privacy risk mitigation for user profiles a data mining approach to assess privacy risk in human mobility data an analytical framework to nowcast well-being using mobile phone data covid- outbreak response: first assessment of mobility changes in italy following lockdown so) big data and the transformation of the city epidemiological characteristics of covid- cases in italy and estimates of the reproductive numbers one month into the epidemic who ebola response team. ebola virus disease in west africa-the first the relationship between human mobility and viral transmissibility during the covid- epidemics in italy page months of the epidemic and forward projections a new framework and software to estimate time-varying reproduction numbers during epidemics measurability of the epidemic reproduction number in data-driven contact networks serial interval of novel coronavirus (covid- ) infections estimating clinical severity of covid- from the transmission dynamics in wuhan, china early transmission dynamics in wuhan, china, of novel coronavirus-infected pneumonia mobile phone data analytics against the covid- epidemics in italy: flow diversity and local job markets during the national lockdown human mobility in response to covid- in france ulteriori disposizioni attuative del decreto-legge febbraio , n. , recante misure urgenti in materia di contenimento e gestione dell'emergenza epidemiologica da covid- , applicabili sull'intero territorio nazionale misure di potenziamento del servizio sanitario nazionale e di sostegno economico per famiglie, lavoratori e imprese connesse all'emergenza epidemiologica da covid- key: cord- - o xniz authors: ren, zongyuan; liao, huchang; liu, yuxi title: generalized z-numbers with hesitant fuzzy linguistic information and its application to medicine selection for the patients with mild symptoms of the covid- date: - - journal: comput ind eng doi: . /j.cie. . sha: doc_id: cord_uid: o xniz fuzzy set theory and a series of theories derived from it have been widely used to deal with uncertain phenomena in multi-criterion decision-making problems. however, few methods except the z-number considered the reliability of information. in this paper, we propose a multi-criterion decision-making method based on the dempster-shafer (ds) theory and generalized z-numbers. to do so, inspired by the concept of hesitant fuzzy linguistic term set, we extend the z-number to a generalized form which is more in line with human expression habits. afterwards, we make a bridge between the knowledge of z-numbers and the ds evidence theory to integrate z-valuations. the identification framework in the ds theory is used to describe the generalized z-numbers to avoid ambiguity. then, the knowledge of z-numbers is used to derive the basic probability assignment of evidence and the synthetic rules in the ds theory are used to integrate evaluations. an illustrative example of medicine selection for the patients with mild symptoms of the covid- is provided to show the effectiveness of the proposed method. for each decision-making problem, there is more or less uncertainty including hesitation, incompleteness or imprecision. these uncertainties are hard to depict by crisp numbers. initially, zadeh ( ) proposed the fuzzy set theory to deal with uncertain problems. later, scholars put forward the concepts of intuitionistic fuzzy sets, hesitant fuzzy sets, probabilistic linguistic term sets and their extended forms to deal with uncertain problems (liao, xu, herrera-viedma, & herrera, ) . however, none of these methods except the z-number (zadeh, ) can reflect the reliability of information. a z-number is an order pair of fuzzy numbers a b ( , ), where a represents the fuzzy restriction of information and b represents the probability measure of the reliability of a. since zadeh put forward the concept of z-numbers, there has been an endless stream of research on z-numbers. for example, aliev and alizadeh ( ) combined possibilistic and probabilistic distributions to propose a calculation model of discrete z-numbers. after that, aliev and huseynov ( ) continued the discrete z-numbers on the basis of discrete z-numbers, and proposed the arithmetic operations of continuous z-numbers. li and deng ( ) proposed a new uncertainty measure of z-numbers by estimating the potential probability distribution with a maximum entropy method. yaakob and gegov ( ) extended the topsis approach to deal with group decision making problems with z-numbers. wang and cao ( ) calculated the score and accuracy functions of linguistic z-numbers based on a distance measure and then ranked z-numbers with an extended todim method. shen and wang ( ) proposed a distance measure considering the randomness and fuzziness of z-numbers and then improved the classical vikor method. in traditional z-numbers, the restriction a and reliability b are expressed as single linguistic terms, for instance, "the air quality is good, very certain". to represent human cognition more flexible, in this study, we use the hesitant fuzzy linguistic term sets (hfltss) (liao, xu, zeng, & merigó, ; rodríguez, martínez, & herrera, ) rather than single linguistic terms to describe the restriction a and reliability b of a z-number and call this extended form as a generalized z-number. for example, "the air quality is good or above, not very certain". in this case, if there are three degrees of certainty such as "generally certain", "certain", "very certain", then "not very certain" means "certain" or "generally certain". compared with the classical z-number, the description of the generalized z-number is not only more consistent with human cognition, but also more able to reflect the uncertainty in decision-making problems. for a z-number, implicitly, we have = z a u x p is b ( , ( )· ) b. in addition, after generalizing z-numbers with hfltss, both the restriction a and reliability b may be expressed as a collection of linguistic terms. in this regard, we substitute the membership function with the utility function in the ds theory and generalize the probability distribution to the basic probability assignment (bpa). in other words, in this paper, u x ( ) a represents the utility function of a and p x represents the bpa function of a. then we can integrate the evaluation according to the synthesis rules of the ds theory. to sum up, the study dedicates to achieving the following innovative contributions: ( ) the elements of z-numbers are generalized from classical fuzzy numbers or linguistic terms to hfltss and the generalized znumber is introduced. in this way, we can enrich the representation form for uncertain information. ( ) the identification framework in the ds theory is used to describe the generalized z-numbers to avoid ambiguity. ( ) the knowledge of z-numbers is used to derive the bpa of evidence, and the synthetic rules of the ds theory are used to integrate znumbers. since the beginning of , novel coronavirus outbreaks have occurred in various parts of the world. although the state and government have invested a lot of money and time in the treatment of novel coronavirus, there is still no recognized effective medicine available for public use. in this paper, a case study concerning the medicine selection for coronavirus patients is given to validate the information fusion method with generalized z-numbers. the paper is organized as follows: the basic concepts are reviewed in section . then, how to generate the bpa of evidence is discussed in section . a multi-criterion decision-making method based on the ds theory and generalized z-numbers is proposed in section . in section , an illustrative example of the medicine selection for coronavirus patients is provided to show the effectiveness of the proposed method. the paper ends with concluding remarks in section . in this section, some concepts involved are introduced. to express uncertain evaluation information, zadeh ( ) proposed the fuzzy set theory. since there is not only fuzzy uncertainty but also hesitation uncertainty in human cognition, rodríguez et al. ( ) proposed the hflts to represent the hesitancy of experts with a set of possible linguistic terms. later, liao et al. ( ) mathematically redefined the hflts and introduced the hesitant fuzzy linguistic elements. the hflts is effective to describe both simple and complex linguistic evaluations. therefore, it has aroused great interest of many scholars (liao et al., ) and many studies on hftlss have been developed, such as the distance and similarity measures (liao, xu, & zeng, ) , correlation measures (liao, gou, & xu, ) , score function (liao et al., ) and aggregation in group decision making (rodríguez, martínez, & herrera, ) . definition ( (liao et al., ; rodríguez et al., ) ). let s be a linguistic term set. an hflts is an ordered finite subset of the consecutive linguistic terms of s, where for human expression to be highly consistent with hesitant fuzzy linguistic elements, (rodríguez et al., ) proposed a translation function e gh , which can handle most types of linguistic expressions, shown as: a z-number (zadeh, ) is an ordered pair composed of two fuzzy numbers expressed as = z a b ( , ). the first component, a, represents a constraint on the value of an uncertain variable x a . the second component, b, represents the probability measure of the reliability of the first component, a. in particular, if the variable x a is a random variable for a, a further definition of a z-number can be denoted as (zadeh, ) : . a simple z-number is shown as fig. . due to the richness of information contained in z-numbers, a large number of studies on z-numbers have been carried out. shen and wang (shen & wang, ) proposed the z-vikor method based on the distance measure of z-numbers for selecting regional circular economy z. ren, et al. computers & industrial engineering ( ) development plan. kang ( ) applied z-numbers to environmental assessment problem based on the ds evidence theory under uncertainty. aboutorab et al. ( ) proposed an integration of z-numbers with the best-worst method and applied it to supplier development. peng and wang ( ) proposed an outranking decision-making method with z-numbers combining electre iii and qualiflex for job-satisfaction evaluation. yang and wang ( ) combined the stochastic multi-criteria acceptability analysis (smaa) with z-numbers to deal with decision-aiding problems. to generate z-numbers, kang, deng, and hewage ( ) used the maximum entropy based on the ordered weight averaging (owa) operator inspired by yager ( the ds theory, also called the ds evidence theory, was first proposed by dempster ( ) and then developed by shafer ( ) to handle uncertain information. it promotes the traditional bayes reasoning approach by allowing weaker conditions than the traditional bayes reasoning approach. the ds evidence theory has been widely used in various fields. yang and xu ( ) proposed the evidential reasoning rule to deal with mcdm problems. chen and deng ( ) combined the ahp and ds evidence theory to evaluate sustainable transport solutions. jiroušek and shenoy ( ) introduced a new entropy of belief functions in the ds theory to measure the total uncertainty of bpa. zhang and deng ( ) and dong, zhang, and li ( ) used the ds evidence theory to analysis fault diagnosis problems in uncertain environment. yuan and luo ( ) proposed a novel intuitionistic fuzzy entropy to determine the weights of evidence. fang and liao ( ) proposed a generalized probabilistic linguistic evidential reasoning approach considering to reassign the remaining belief degree to an envelopment of focal elements. fu and chang ( ) proposed a group satisfaction concept to analysis multiple criteria group decision making problems with evidential reasoning approach. ng and law ( ) used the evidential reasoning to analysis sentiment words in social networks to investigate consumer preferences. be a set of mutually exclusive and collectively exhaustive events. we call the set a frame of discernment and denote p ( ) as the power set composed of n elements of , that is , the mass function m (·) is a mapping from to the number between and . m (·) satisfies the following conditions: the elements of p ( ) or the subset of are called propositions. in other words, if a p ( ), then, a is a proposition. the mass function is also called the bpa or belief function, which represents how strongly the evidence supports a. if > m a ( ) , then, a is called a focal element. in the ds theory, for two evidences m and m , the synthesis rule is denoted as = m m m with dempster's rule of combination defined as: to avoid the conflict between evidences, a discount coefficient could be introduced. a discounting coefficient [ , ] represents the weight (reliability) of the evidence. then, the updated evidence m is represented as follow: in eq. ( ), m ( ) can be divided into two parts = m m ( ) ( ) and = m ( ) . then, the belief degree of a focal element can be gained by: according to the utility theory (yang & xu, ) , we can get the upper and lower bounds of the utility of alternative x i , and then rank the alternatives: in a classical z-number = z a b ( , ), the constraint a and reliability b are represented by single linguistic terms. when the constraint a and reliability b become complicated and are expressed in more than one linguistic term, we can use hfltss to represent the hesitancy information of experts. to make the meaning of the expression unambiguous, we use the identification framework of the ds theory to represent the constraint a and reliability b in z-numbers. there are two problems to be solved. the evaluation grades in the identification z. ren, et al. computers & industrial engineering ( ) framework are indeed not independent. in this regard, we introduce the shapley value to assign a reasonable utility value to each evaluation grade. additionally, it is also an important problem to translate the evaluations of reliability given by experts into crisp values. to this point, the score function of hfltss can be used to handle this problem. a fuzzy measure µ is a set function on the set of (shapley & shubik, ) . represents the weight or utility of the element h j . if the elements in h are independent, then, in fact, the elements in the set h are usually not independent. to determine the expected marginal contribution of a particular element to the set h , the shapley value can be introduced. the shapley index for each j h is defined as (shapley & shubik, ) : the shapley value j can be interpreted as a kind of average value of marginal contribution of a single element in all possible coalitions. in the ds theory, the elements of identification framework are mutually exclusive and exhaustive but not necessarily independent. in yang and xu ( )'s evidential reasoning approach, all evaluation grades are considered to be independent. this is not realistic. for example, "good" and "very good" are actually not independent. suppose that the utility of "good" is . and the utility of "very good" is . , according to yang and xu ( ) 's approach, the utility of "at least good" is equal to . . however, if the decision-maker has a certain preference, the result will be greater than or less than . . the shapley value can be used to handle this problem well when evaluation grades are considered to be non-independent. in fact, the choquet integral is a more widely used tool for the non-additive fuzzy measure. however, the choquet integral is usually used to solve the interactions of weighted criteria, and this study is intended to analyze the effectiveness of non-mutually exclusive evaluation levels. so, the shapley value may not be a better approach but it may be a more appropriate way to do it. although experts give the evaluation of reliability b, it is not easy to quantify especially if it is expressed in more than one linguistic term. gou and xu ( ) proposed two transformation functions to explain the meanings of hesitant fuzzy linguistic term sets. further, liao et al. ( ) proposed a score function of hflts based on the hesitancy degree and linguistic scale function. this score function mainly considers the positions of linguistic terms and the number of linguistic terms. it can be used to measure the reliability b expressed by hflts reasonably. can be defined as: where hd h ( ) s is the hesitancy degree of h s , and g s ( ) l is the semantic of s l , which can be calculated as follows: in addition, how to deal with the conflict between evidences is a long-standing research topic in the ds theory. one of the common methods is to assign different discount coefficients to different pieces of evidences. in recent studies (dong et al., ; murofushi & sugeno, ; zhang & deng, ) , the discount coefficient was obtained by measuring the distance between evidences. the smaller the total distance between one evidence and another is, the more reliable the evidence is, and the larger the discount coefficient should be given. when we use z-numbers to give evaluations, the reliability is measurable. given that a piece of evidence may be made up of several z-numbers, the function of discount coefficient can be given by the reference score function as follow: where hd e ( ) ij is the hesitancy degree of a piece of evidence e ij , e h ( ) sl is the score of the reliability b in a z-number, and n is the number of the z-number in a piece of evidence. according to the definition of a z-number, . in this paper, we substitute the membership function u x ( ) it is easy to understand why we derive the result of a basic probability allocation rather than probability. after we extend the z-numbers with hflts, the focal element of the z-evaluation information is likely to be made up of an envelope of evaluation levels rather than a single linguistic term. considering that there are multiple z-evaluations due to the hesitancy of decision makers, it is necessary to normalize m h ( ) j . next, we can integrate the z-evaluation with m h ( ) j according to the ds evidence synthesis rules. suppose that there is an mcdm problem with i alternatives according to the analysis in the previous section, we propose an mcdm method based on the ds theory and generalized z-numbers. to facilitate the application, we summarize the algorithm of this method below. the flow chart of this algorithm is illustrated in fig. . z. ren, et al. computers & industrial engineering ( ) step , and e ij is the evaluation of the j -th criterion for the i -th alternative. then, all those evidences form a × i j decision matrix × d i j . step . (revise the utility value of the evaluation grades by shapley value) according to the preference of experts, the utility value of the evaluation grade h j and all of the subsets of h are given. by eq. ( ), the utility value of the evaluation grade h j is revised by the corresponding shapley value. step . (generate the bpa) by eqs. ( )- ( ), the score of the reliability of evidence can be calculated. then, the bpa m h ( ) j can be calculated by eq. ( ). because there may be more than one evaluation in one piece of evidence, sometimes we need to normalize the bpa. step . (generate the discount coefficient) in step , we generate the score of the reliability. in this step, we still use this method to calculate the comprehensive reliability of a piece of evidence as a discount coefficient by eq. ( ) . again, the discount coefficient needs to be normalized by eq. ( ). step . (integrate all evaluations by the ds theory) after obtaining the bpa for each piece of evidence, the decision matrix can be integrated by the evidential reasoning method (yang & xu, ) . step . (rank the alternatives) we finally rank the alternatives according to their utilities. as shown in the steps above, our approach is simple and easy to understand. the highlight is that we consider the reliability of information when constructing the z-evaluation decision matrices. in addition, when calculating the utility values of evaluation levels, we introduce the shapley value to deal with the non-exclusive relationships between evaluation levels. all of these are the shortcomings of traditional evidential reasoning approach. these illustrate that our approach is more suitable in dealing with uncertain decision problems. in this section, an illustrative example concerning the medicine selection for the patients with mild symptoms of the covid- is given to show the effectiveness of the proposed method. in december , a new type of coronavirus pneumonia broke out in wuhan, china, which wreaked havoc across the country and seriously threatened human health, drawing worldwide attention. the covid- is a b type infectious disease similar to the sars coronavirus in , causing severe symptoms of pneumonia. coronaviruses belong to the coronaviridae family in the nidovirales order. corona represents crown-like spikes on the outer surface of the virus; thus, it was named as a coronavirus (shereen & khan, ) . the virus is transmitted through droplets, close contact and other means, and patients in the incubation period could potentially transmit the virus to other persons. according to current observations, the new coronavirus is weaker than sars in pathogenesis, but has stronger transmission competence (tian, ) . according to the data from the official website of national health commission of the people's republic of china (http://www.nhc. gov.cn/), there was a clear trend of the epidemic spreading after january , . by april , , china had a total of deaths, confirmed cases and , cured cases. since mid-february, the number of confirmed cases has been on a downward trend. in addition, the number of people cured has been increased and the number of deaths was small and stable. the spread is illustrated in fig. . unfortunately, no drugs or vaccines have been approved for the treatment of human coronaviruses. several approaches for the control or prevention of the covid- infections can be envisaged, including vaccines, monoclonal antibodies, oligonucleotide-based therapies, peptides, interferon therapies, and small-molecule drugs (li & clercq, ) . however, in terms of antiviral medicine, there is still no specific medicine. ribavirin combined with interferon is still recommended for the diagnosis and treatment of the covid- in china due to its effectiveness in treating middle east respiratory syndrome (mers). lopinavir is one kind of protease inhibitor used to treat hiv infection, with ritonavir as a booster. lopinavir/ritonavir have anti coronavirus activity in vitro. besides, remdesivir may be the best potential medicine for the treatment of the covid- . as a drug undergoing a clinical trial, remdesivir has been shown to have highly proofreading ability, and mutated drug resistance can effectively reduce virulence (liu & morse, ) . in addition, chinese medicine, such as lianhuaqingwen capsules, has also played an important role in the prevention and treatment of new respiratory infectious diseases (lu, ) . however, it is worth noting that there are no specific antiviral medicines or vaccines for the covid- . all medicine need to be further confirmed in clinical trials. the most detailed breakdown of symptoms of the disease comes from a recent world health organization analysis of more than , confirmed cases in china (world health organization, ). according to the statistics, here are the most common symptoms and the percentage of people who had them: fever: %, dry cough: %, fatigue: %, coughing up sputum, or thick phlegm, from the lungs: %, shortness of breath: %, bone or joint pain: % and the rest are small enough to ignore (world health organization, ). in this study, we select four medicine as candidates for patients of the covid- . they are ribavirin, lopinavir/ritonavir, remdesivir and lianhuaqingwen capsules. medicine should be selected not only for their effect on symptoms, but also for their antiviral activity and their possible side effects. thus, we select four indicators antiviral activity, coolify, ease breathing and side effect as criteria for expert evaluation. to find the best medicine for the patients with mild symptoms of the covid- , we suppose that the evaluation grades of constraint a is predetermined as = and that of reliability b is predetermined as very certain vc ( ), = r extremely certain ec ( )}. according to the experience and opinions of experts, the performances of the four medicine on four criteria are given in table . below we use our proposed decision-making method to help the patients with mild symptoms of the covid- to select the most suitable medicine. step . (generate decision matrices in the form of z-numbers) we can translate these initial linguistic expressions into generalized znumbers. the decision matrix is obtained as table . step the score of the reliability for the rest evidences can be calculated as: then, by eq. ( ) and ( ), the bpa can be obtained by dividing the score of the reliability by the utilities of evaluation grades. thus, the new decision matrix is translated to table . step . (generate the discount coefficients) by eq. ( ), the discount coefficient of each criterion can be calculated as: . step . (integrate all evaluations by the ds theory) by eqs. ( )-( ), the decision matrix can be integrated with the synthesis rules of the ds theory. the results are listed in table . step . (rank the alternatives) by eqs. ( )-( ), based on the utility values and the belief degrees obtained in previous steps, we can calculate u x ( ) i for each alternative: , that is, the most effective medicine for patients with mild symptoms of the covid- is remdesivir. the purpose is to use the existing evaluation information to generate z-numbers. by contrast, in this paper, we use the reliability of information to integrate information indirectly instead of directly relying on the probability distribution, so as to sort and select the best alternative. obviously, our method provides a solution to uncertain mcdm problem with reliability. on the other hand, the membership function in the form of triangular fuzzy number used by liu et al. ( ) is too simple to present complex evaluation information. in addition, transforming the reliability part of z-numbers into triangular fuzzy numbers will lead to the distortion and loss of the original z-evaluation information. in this respect, we generalize the elements of z-numbers from classical fuzzy numbers or linguistic terms to hfltss, which enriches the representation form of uncertain information. in addition, the utility function is used to replace the membership function, which not only well conforms to the knowledge of the ds evidence theory, but also considers the mutual influence between evaluation levels. in general, the method we proposed makes full use of the original definition of z-numbers, and it is easier to integrate z-evaluation information by using the evidence aggregation rules of the ds theory than to convert z-numbers into classical fuzzy numbers. due to the destructive power of novel coronavirus, there are no effective drugs that can be used in clinical trials to protect against and treat symptoms caused by the coronavirus. in this case, unlike conventional drug selection issues, the expert's opinion is especially important to medicine options for the patients with mild symptoms of the covid- . the generalized z-numbers evaluation model proposed in this paper fully considers the reliability of information when evaluating information under multiple criteria, so it fits well with the drug table the new decision matrix regarding four medicines. although many theories are devoted to describing uncertain phenomena include hesitation, incompleteness or imprecision for mcdm problems, none of these except the z-number can well describe the reliability of information. the traditional z-number is represented by a single linguistic term or uncertain linguistic terms. in this paper, we generalized the form of z-number to hesitant fuzzy linguistic environment from fuzzy sets. because the expression form of the generalized znumbers is more complex, to avoid ambiguity, the generalized znumbers were described by the identification framework of the ds theory. then, the knowledge of z-numbers was used to derive the bpa of evidence, and the synthetic rules in the ds theory were used to integrate evaluations. in this process, considering that the evaluation grades may be not independent of each other, the shapley value was introduced to determine the utility values of evaluation grades. finally, the effectiveness of this proposed method was illustrated by the selection of medicine for patients with mild symptoms of the covid- . in this study, we used the shapley value to deal with non-dependent evaluation grades. in fact, the d-number, as an extension of the ds theory, is based on the premise that evaluation levels in the identification framework are not mutually exclusive. so, we are going to investigate the generalized z-numbers based on the knowledge of dnumbers in the future. on the other hand, although we considered that evaluation levels are not independent of each other, the dependency between criteria was not considered. in the future research, we will try to introduce the interactions of criteria into this process and get a more perfect decision-making method. zongyuan ren: conceptualization, formal analysis, writing -original draft. huchang liao: supervision, writing -original draft. yuxi liu: data curation. zbwm: the z-numbers extension of best worst method and its application for supplier development the arithmetic of discrete z-numbers the arithmetic of continuous z-numbers a modified method for evaluating sustainable transport solutions based on ahp and dempster-shafer evidence theory upper and lower probabilities induced by a multivalued mapping combination of evidential sensor reports with distance function and belief entropy in fault diagnosis generalized probabilistic linguistic evidential reasoning approach for multi-criteria decision-making under uncertainty multiple criteria group decision making based on group satisfaction novel basic operational laws for linguistic terms, hesitant fuzzy linguistic term sets and probabilistic linguistic term sets a new definition of entropy of belief functions in the dempster-shafer theory decision making using z-numbers under uncertain environment stable strategies analysis based on the utility of z-numbers in the evolutionary games environmental assessment under uncertainty using dempster-shafer theory and z-numbers generating z-numbers based on owa weights using maximum entropy therapeutic options for the, novel coronavirus ( -ncov) a new uncertainty measure of discrete z-numbers hesitancy degree-based correlation measures for hesitant fuzzy linguistic term sets and their applications in multiple criteria decision making score-hedlisf: a score function of hesitant fuzzy linguistic term set based on hesitant degrees and linguistic scale functions: an application to unbalanced hesitant fuzzy linguistic multimoora. information fusion hesitant fuzzy linguistic term set and its application in decision making: a state-of-the art survey distance and similarity measures for hesitant fuzzy linguistic term sets and their application in multi-criteria decision making qualitative decision making with correlation coefficients of hesitant fuzzy linguistic term sets. knowledge-based systems learning from the past: possible urgent prevention and treatment options for severe acute respiratory infections caused by -ncov derive knowledge of z-numbers from the perspective of dempster-shafer evidence theory drug treatment options for the -new coronavirus ( -ncov) an interpretation of fuzzy measures and the choquet integral as an integral with respect to a fuzzy measure investigating consumer preferences on product designs by analyzing opinions from social networks using evidential reasoning outranking decision-making method with z-numbers cognitive information hesitant fuzzy linguistic term sets for decision making a group decision making model dealing with comparative linguistic expressions based on hesitant fuzzy linguistic term sets a mathematical theory of evidence solutions of n-person games with ordinal utilities z-vikor method based on a new comprehensive weighted distance measure of z-numbers and its application covid- infection: origin, transmission, and characteristics of human coronaviruses -ncov new challenges from coronavirus multi-criteria decision-making method based on distance measure and choquet integral for linguistic z-numbers report of the who-china joint mission on coronavirus disease (covid- ) interactive topsis based group decision making methodology using z-numbers on ordered weighted averaging aggregation operators in multi-criteria decision making smaa-based model for decision aiding using regret theory in discrete z-numbers context evidential reasoning rule for evidence combination approach for multi-attribute decision making based on novel intuitionistic fuzzy entropy and evidential reasoning fuzzy sets. information and control a note on z-numbers engine fault diagnosis based on sensor data fusion considering information quality and evidence theory the work was supported by the national natural science foundation of china ( , ). by eq. ( ), the shapley values for criteria can be calculated as follows:( ) when k is , we have thus, the shapley value of the evaluation grade can be obtained as:then, we can get the shapley value for the rest of evaluation grades:step . (generate the bpa) first, we calculate the score of the reliability of evidence. by eqs. ( )-( ), the score of the reliability for fig. . the spread of covid- in china (from january to april , ). the evaluation matrix regarding four medicines. key: cord- -bjx td authors: vanhems, philippe; barrat, alain; cattuto, ciro; pinton, jean-françois; khanafer, nagham; régis, corinne; kim, byeul-a; comte, brigitte; voirin, nicolas title: estimating potential infection transmission routes in hospital wards using wearable proximity sensors date: - - journal: plos one doi: . /journal.pone. sha: doc_id: cord_uid: bjx td background: contacts between patients, patients and health care workers (hcws) and among hcws represent one of the important routes of transmission of hospital-acquired infections (hai). a detailed description and quantification of contacts in hospitals provides key information for hais epidemiology and for the design and validation of control measures. methods and findings: we used wearable sensors to detect close-range interactions (“contacts”) between individuals in the geriatric unit of a university hospital. contact events were measured with a spatial resolution of about . meters and a temporal resolution of seconds. the study included hcws and patients and lasted for days and nights. , contacts were recorded overall, . % of which during daytime. the number and duration of contacts varied between mornings, afternoons and nights, and contact matrices describing the mixing patterns between hcw and patients were built for each time period. contact patterns were qualitatively similar from one day to the next. % of the contacts occurred between pairs of hcws and hcws accounted for % of all the contacts including at least one patient, suggesting a population of individuals who could potentially act as super-spreaders. conclusions: wearable sensors represent a novel tool for the measurement of contact patterns in hospitals. the collected data can provide information on important aspects that impact the spreading patterns of infectious diseases, such as the strong heterogeneity of contact numbers and durations across individuals, the variability in the number of contacts during a day, and the fraction of repeated contacts across days. this variability is however associated with a marked statistical stability of contact and mixing patterns across days. our results highlight the need for such measurement efforts in order to correctly inform mathematical models of hais and use them to inform the design and evaluation of prevention strategies. the control of hospital-acquired infections (hai) is largely based on preventive procedures derived from the best available knowledge of potential transmission routes. the accurate description of contact patterns between individuals is crucial to this end, as it can help to understand the possible transmission dynamics and the design principles for appropriate control measures. in particular, the mutual exposures between patients and health-care workers (hcws) have been documented for bacterial and viral transmission since decades [ , , ] . transmission might be the result of effective contact, as in the cases of s. aureus [ , ] , k. pneumoniae [ ] or rotavirus [ ] , of exposure to contaminated aerosols, as for m. tuberculosis [ ] , or the result of exposure to droplets, as for influenza [ ] . some pathogens such as influenza can also be transmitted by different routes. although close-range proximity and contacts between individuals are strong determinants for potential transmissions, obtaining reliable data on these behaviors remains a challenge [ ] . data on contacts between individuals in specific settings or in the general population are most often obtained from diaries and surveys [ , , , ] and from time-use records [ ] . these approaches provide essential information to describe contacts patterns and inform models of infectious disease spread. the gathered data, however, often lack the longitudinal dimension [ , , ] and the high spatial and temporal resolution needed to accurately characterize the interactions among individuals in specific environments such as hospitals. moreover, they are subject to potential biases due to behavioral modifications due to the presence of observers, to short periods of observation, and especially to missing information and recall biases. evaluating biases and understanding the accuracy of the collected data is therefore a difficult task [ ] . in this context, the use of electronic devices has recently emerged as an interesting complement to more traditional methods [ ] . in particular, wearable sensors based on active radio-frequency identification (rfid) technology have been used to measure face-to-face proximity relations between individuals with a high spatio-temporal resolution in various contexts [ ] that include social gatherings [ , ] , schools [ , ] and hospitals [ , ] . the amount of available data, however, is still very limited, high-resolution contact data relevant for the epidemiology of infectious diseases are scarce, and the longitudinal aspects of contact patterns have not been investigated in detail, prompting further investigation. in this paper we report on the use of wearable proximity sensors [ ] to measure the numbers and durations of contacts between individuals in an acute care geriatric unit of a university hospital. we investigate the variability of contact patterns as a function of time, as well as the differences in contact patterns between individuals with different roles in the ward. we document the presence of individuals with a high number of contacts, who could be considered as potential super-spreaders of infections. some implications of our results regarding prevention and control of hospital-acquired infections are discussed. the measurement system, developed by the sociopatterns collaboration [ ] , is based on small active rfid devices (''tags'') that are embedded in unobtrusive wearable badges and exchange ultra-low-power radio packets [ , , , ] . the power level is tuned so that devices can exchange packets only when located within - . meters of one another, i.e., package exchange is used as a proxy for distance (the tags do not directly measure distances). individuals were asked to wear the devices on their chests using lanyards, ensuring that the rfid devices of two individuals can only exchange radio packets when the persons are facing each other, as the human body acts as a rf shield at the frequency used for communication. in summary the system is tuned so that it detects and records close-range encounters during which a communicable disease infection could be transmitted, for example, by cough, sneeze or hand contact. the information on face-to-face proximity events detected by the wearable sensors is relayed to radio receivers installed throughout the hospital ward (bedrooms, offices and hall). the system was tuned so that whenever two individuals wearing the rfid tags were in face-to-face proximity the probability to detect such a proximity event over an time interval of seconds was larger than %. we therefore define two individuals to be in ''contact'' during a -second interval if and only if their sensors exchanged at least one packet during that interval. a contact is therefore symmetric by definition, and in case of contacts involving three or more individuals in the same -second interval, all the contact pairs were considered. after the contact is established, it is considered ongoing as long as the devices continue to exchange at least one packet for every subsequent s interval. conversely, a contact is considered broken if a -second interval elapses with no exchange of packets. we emphasize that this is an operational definition of the human proximity behavior that we choose to quantify, and that all the results we present correspond to this precise and specific definition of ''contact''. we make the raw data we collected available to the public as datasets s -s in file s and on the website of the sociopatterns collaboration (www. sociopatterns.org). data were collected in a short stay geriatric unit ( beds) of a university hospital of almost beds [ ] in lyon, france, from monday, december , at : pm to friday, december , at : pm. during that time, professional staff worked in the unit and patients were admitted. we collected data on the contacts between staff members ( % participation rate) and patients ( % participation rate). the participating staff members were nurses or nurses' aides, medical doctors and administrative staff. in the ward, all rooms but were single-bed rooms. each day teams of nurses and nurses' aides worked in the ward: one of the teams was present from : am to : pm and the other from : pm to : pm. an additional nurse and an additional nurse' aid were moreover present from : am to : pm. two nurses were present during the nights from : pm to : am. in addition, a physiotherapist and a nutritionist were present each day at various points in time, with no fixed schedule, and a social counselor and a physical therapist visited on demand (in our analysis they are considered as nurses). two physicians and interns were present from : am to : pm each day. visits were allowed from : am to : pm but visitors were not included in the study. in advance of the study, staff members and patients were informed on the details and aims of the study. a signed informed consent was obtained for each participating patient and staff member. all participants were given an rfid tag and asked to wear it properly at all times. no personal information was associated with the tag: only the professional category of each hcw and the age of the patients were collected. the study was approved by the french national bodies responsible for ethics and privacy, the ''commission nationale de l'informatique et des libertés'' (cnil, http://www.cnil.fr) and the ''comité de protection des personnes'' (http://www.cppsudest .com/) of the hospital. individuals were categorized in four classes according to their activity in the ward: patients (pat), medical doctors (physicians and interns, med), paramedical staff (nurses and nurses' aides, nur) and administrative staff (adm). med and nur professionals form a group named hcw. the contact patterns were analyzed using both the numbers and the durations of contacts between individuals. for each individual we measured the number of other distinct individuals with whom she/he had been in contact, as well as the total number of contact events she/he was involved in, and the total time spent in contact with other individuals. these quantities were aggregated for each class and for each pair of role classes in order to define contact matrices that describe the mixing patterns between classes of individuals. the longitudinal evolution of the contact patterns was studied by considering, in addition to the entire study duration, several shorter time intervals: we divided the study duration into daytime periods ( : am to : pm) and nights ( : pm to : am); daytime periods were divided in morning ( : am to : pm) and afternoon ( : pm to : pm) shifts, and we also considered data aggregated on a -hour timescale. we finally considered the similarity of contact patterns between successive days, by measuring the fraction of contacts that were repeated from one day to the next, as such information is particularly relevant when modeling spreading phenomena [ , , ] . overall, , contacts occurred during the study, with a cumulative duration of , s (approx. , minutes or hours). , contacts ( . %) included at least one nur, , ( . %) included at least one med, and , ( . %) at least one patient. table reports the average number and duration of contacts of individuals in each class over the whole study duration. most contacts involve at least one nur and/or one med, and nurs and meds have on average the largest number of contacts, as well as the largest cumulative duration in contact. large standard deviations are however observed: the distributions of the contact durations and of the numbers and cumulative durations of contacts are broad, as also observed in many other contexts [ , , , ] . important variations are observed even within each role class. in particular, contacts of much larger duration than the average are observed with a non-negligible frequency. the total number of contacts between individuals belonging to specific classes is reported in table and the corresponding contact matrices are shown in figure . we report contact matrices giving the total numbers and cumulative durations of contacts between individuals of given classes, as well as contact matrices taking into account the different numbers of individuals in each class. contacts were most frequent between two nurs ( , contacts, %), followed by nur-par contacts ( , contacts, %), and by contacts between two meds ( , contacts, %). very few contacts between pats or between members of the adm group were observed. as reported in table , among the , contacts detected, , ( . %) occurred during daytime, for a total duration of , s (approx. , min or h). contacts ( . %) occurred during nights (lasting , s, approx. min or h). on average we recorded , contacts per morning, , per afternoon, and per night. the evolution of the number of contacts at the more detailed resolution of one-hour time windows is reported in figure . the number of contacts varied strongly over the course of a day, but the evolution was similar from one day to another (for day and day , contacts were recorded after : pm and before : pm respectively, see methods), with very few contacts at night and a maximum around - am. the number of contacts between individuals of specific classes also depends on the period of the day. contacts between nurs, and between nurs and pats, were predominant in the morning while contacts between meds remained similar between mornings and afternoons. overall, . % of contacts between nurs and pats occurred on the morning, . % on the afternoon and . % during the night. figure reports the contact matrices giving the numbers of contacts between individuals of specific classes for each morning, afternoon and night. the absolute numbers of contacts varied from one morning (resp., afternoon or night) to the next, but the mixing patterns remained very similar. differences were observed between morning, afternoon and night patterns. the main difference between morning and afternoon periods came from larger numbers of contacts involving nurs in the morning. almost only contacts involving nurs and pats were observed at night. although the aggregated observables reported above are very similar from day to day, the precise structure of the daily contact network varied strongly: the fraction of common neighbors of an individual between two different days is on average just of %. this value is smaller than the one observed in a school [ ] , but much larger than the one measured for the attendees of a conference [ ] . the cumulative number and duration of contacts of each individual, as identified by his/her badge identification number, are reported in figure and table . a small number of hcws accounted for most of the contacts observed between hcws and pats, both in terms of number and cumulative duration. for instance, nurs (representing % of all hcws) accounted for . % of the number of contacts and . % of the cumulated duration of the contacts with pats (number of contacts and cumulative duration of contacts of a given individual are strongly correlated, r = . ). the number of distinct individuals contacted by a given individual was also correlated with the total number of contacts of the same individual (r = . ). these hcws had a much larger number and duration of contacts than average, as shown in table . the objective of the present study was to describe in detail the contacts between individuals in a healthcare setting. such data can help to accurately inform computational models of the propagation of infectious diseases and, as a consequence, to improve the design and implementation of prevention or control measures based on the frequency and duration of contacts. numbers and duration of contacts were characterized for each class of individuals and for individuals belonging to given class pairs, yielding contact matrices that represent important inputs for realistic computational models of nosocomial infections. as also measured in other contexts [ , , , , ] , the numbers and durations of contacts display large variations even across individuals of the same class: the resulting distributions were broad, with no characteristic time scale. as a consequence, even though the average durations of contacts were rather short, contacts of much longer durations than average occured with nonnegligible frequency. contacts involving either two nurs or between nurs and pats accounted for the majority of contacts, both in terms of numbers and of global durations. very few contacts occurred between pats: this might be a specificity of wards with mostly single rooms, and other wards in which patients are not alone in a room or in which they move around more might yield more numerous contacts between pats. these results are consistent with previous studies [ , ] carried out in pediatrics, surgery and intensive care units, and provide additional evidence that nurses and assistants may be the most essential target group for prevention measures [ , ] . the detailed information about the number and duration of contacts also allowed us to highlight the presence of a limited number of ''super-contactors'' among hcws who account for a large part of all contacts. a large number of contacts could correspond to different situations, namely to contacts with many different patients, or to many contacts with few patients. our results show that the cumulated number of contacts and the number of distinct persons contacted are correlated; this indicates that in the hospital context under study the super-contactors have contacts with many different patients. they could therefore potentially play the role of super-spreaders, whose importance in the spread of infectious agents has been highlighted both theoretically [ , ] and empirically [ ] . this suggests that their role class should be targeted for prevention measures. these results are in concordance with the central role of hcws in hospital wards, as repeated contacts with patients are often necessary for the quality of healthcare. however, since outbreaks of measles and influenza involving this population have been observed [ , ] , the possibility for hcws to be super-contactors emphasizes the need to reduce their exposure to infection and to limit the risk of transmission to patients. this should stimulate the strict implementation of preventive measures including hand washing, vaccination, or wearing of masks [ ] . in addition, hcws could be warned against the risk brought forth by unnecessary large numbers or long durations of contacts, especially with patients. limiting the contacts of hcws (either with pats or with other hcws) might however not be feasible without altering the quality of care. in this respect, the investigation of the temporal evolution of the numbers of contacts may help envision and discuss changes in the organization of care during epidemic or pandemic periods. the numbers of contacts varied indeed greatly along the course of each day, clearly highlighting the periods of the day (here, the mornings) during which transmission could occur with higher probability. the high numbers of contacts during mornings may indicate a potential overexposure to infection for pats and nurs, and one may imagine a different organization toward a smoothing of the number of contacts throughout mornings and afternoons. this would decrease the density of contacts, in particular between nurs, at each specific moment, while maintaining the daily number and duration of contacts between nurs and pats, and overall tend to limit their overexposure [ ] . the potential efficacy of such or other changes in the healthcare organization should of course be tested through numerical simulations of spreading phenomena, and their feasibility would moreover need to be asserted through discussions with the staff. the measurement of contact patterns by means of wearable sensors presents strengths and limitations that are worth discussing. strong advantages are the versatility of the sensing strategy (i.e., the unobtrusiveness of wearable sensors and the prompt deployability of receivers) and the fact that it does neither require the constant presence of external observers nor interfere with the delivery of care in the ward. another strength lies in the high spatial and temporal resolution: behavioral differences across role classes can be detected, and longitudinal studies are possible. high participation ratios are also crucial: similarly to a previous study in another hospital [ ] , the rate of acceptance among hcws and patients turned out to be very high ( %). the information meetings held before the study, providing a clear exposition of the scientific objectives and of the privacy aspects, most probably played an important role in achieving such a high participation rate. the versatility of a system based on wearable sensors and easily deployable data receivers makes it possible to repeat similar studies in different environments and to compare results across contexts [ ] . in particular, several of the reported findings are very similar to those described in [ ] in a different hospital, situated in a different country, and in a different type of ward (paediatric): large variability in the cumulative duration of contacts, small number of contacts between patients, and large numbers and durations of contacts between nurs. repeating measurements in the same ward and in other wards represents an important step towards understanding the similarities and differences of contact patterns in hospital settings, and allows to generalize the observations to more correctly inform models. the measurement approach we used here has also several limitations. contacts were defined as face-to-face proximity, without any information on physical contact between individuals. therefore, the assumption that the number of contacts reflects disease exposure can be appropriate for respiratory infections such as influenza, or for similar diseases that can be transmitted by various routes at a distance of meter around an index case [ ] . the use of close-range proximity as a proxy for the transmission of bacterial infection acquired by cross transmission, such as s aureus or enterobacteriacea, is more questionable. other factors related to specific attributes of individuals (e.g., vaccination or immunosuppression), of the microbial agent (e.g., resistance or virulence) or of the environment (e.g., specialty of ward) may also alter the relationship between contact frequency/duration and transmission. in this respect, a validation with simultaneous direct observation and human annotation of the contacts would be of particular interest. finally, it is difficult to assess whether individuals modified their behavior in response to wearing rfid badges, but direct observation indicated that hcws were focusing on their daily activities and most probably were not influenced by the presence of the badges. badges were not proposed to visitors and this potential external source of infection was not studied. this study complements previous work [ , , , , , ] and provides data that can be used to explore the spread of infection in confined settings through mathematical and computational modeling. models of transmission within hospitals might be based on contact matrices such as those presented here, and used to better understand the epidemiology of different types of microbial agents, to assess the impact of control measures, and to help improve the delivery of care during emergency epidemic situations. in our study, specific mixing patterns were observed between different classes of individuals, showing a clear departure from homogeneous mixing, as it is expected in a hospital setting, and highlighting the relevance of correctly informed contact matrices. moreover, although an important turnover between the persons in contact with a given individual was observed across different days, and although the average contact durations between different classes of individuals varied between mornings, afternoons and nights, the contact patterns remained statistically very similar across successive days. these results suggest that, in order to correctly inform computational models, data collected over just a few hours might be insufficient, but that measures lasting hours would be sufficient to evaluate the statistical properties of contact patterns as well as the mixing patterns between individual classes, and to estimate the similarity between the contacts of an individual across days. the statistical features of the gathered data could then be used to model contact patterns over longer time scales. the scarcity of contact data [ , ] calls for further measurement campaigns to validate and consolidate the results across other hospital units, other contexts, and over longer periods of time. additional data sets would also be useful to build and test proxies that could replace systematic detailed measurement of contact patterns, such as the ones put forward in [ , , ] . in order to explore the relationship between complex contacts network and the spreading of infections, it would be particularly interesting to collect simultaneously high-resolution contact data and microbiological data describing the infection status of participating individuals. combining these heterogeneous sources of information within appropriate statistical models would allow elucidating the relation between the risk of disease transmission and contacts patterns, in order to disentangle transmission likelihood from contact frequency. finally, feedback of the results to hcws could be an innovative pedagogical tool in health care settings. file s dataset s . time-resolved contact network for day , in gexf format. each node corresponds to one rfid tag and has an attribute ''role'' that indicates the role of the individual wearing the tag: patient (pat), medical doctor (med), paramedical staff (nur) or administrative staff (adm). each edge has attributes: ''ncontacts'', the number of contact events between the corresponding rfid tags; ''cumulativeduration'', the total duration of these contacts, and ''list_contacts'', the explicit list of time intervals during which the individuals were in contact. dataset s . timeresolved contact network for day , in gexf format. each node corresponds to one rfid tag and has an attribute ''role'' that indicates the role of the individual wearing the tag: patient (pat), medical doctor (med), paramedical staff (nur) or administrative staff (adm). each edge has attributes: ''ncontacts'', the number of contact events between the corresponding rfid tags; ''cumulativeduration'', the total duration of these contacts, and ''list_contacts'', the explicit list of time intervals during which the individuals were in contact. dataset s . time-resolved contact network for day , in gexf format. each node corresponds to one rfid tag and has an attribute ''role'' that indicates the role of the individual wearing the tag: patient (pat), medical doctor (med), paramedical staff (nur) or administrative staff (adm). each edge has attributes: ''ncontacts'', the number of contact events between the corresponding rfid tags; ''cumulativeduration'', the total duration of these contacts, and ''list_contacts'', the explicit list of time intervals during which the individuals were in contact. dataset s . time-resolved contact network for day , in gexf format. each node corresponds to one rfid tag and has an attribute ''role'' that indicates the role of the individual wearing the tag: patient (pat), medical doctor (med), paramedical staff (nur) or administrative staff (adm). each edge has attributes: ''ncontacts'', the number of contact events between the corresponding rfid tags; ''cumulativeduration'', the total duration of these contacts, and ''list_contacts'', the explicit list of time intervals during which the individuals were in contact. dataset s . timeresolved contact network for day , in gexf format. each node corresponds to one rfid tag and has an attribute ''role'' that indicates the role of the individual wearing the tag: patient (pat), medical doctor (med), paramedical staff (nur) or administrative staff (adm). each edge has attributes: ''ncontacts'', the number of contact events between the corresponding rfid tags; ''cumulativeduration'', the total duration of these contacts, and ''list_contacts'', the explicit list of time intervals during which the individuals were in contact. (zip) health-care workers: source, vector, or victim of mrsa? tuberculosis exposure of patients and staff in an outpatient hemodialysis unit risk of influenza-like illness in an acute health care setting during community influenza epidemics in modeling the spread of methicillin-resistant staphylococcus aureus in nursing homes for elderly community and nosocomial transmission of panton-valentine leucocidinpositive community-associated meticillin-resistant staphylococcus aureus: implications for healthcare klebsiella pneumoniae bloodstream infections among neonates in a high-risk nursery in cali, colombia outbreak of rotavirus gastroenteritis in a nursing home transmission of drug-susceptible and drug-resistant tuberculosis and the critical importance of airborne infection control in the era of hiv infection and highly active antiretroviral therapy rollouts transmission of pandemic a/h n influenza on passenger aircraft: retrospective cohort study close encounters of the infectious kind: methods to measure social mixing behaviour social mixing patterns for transmission models of close contact infections: exploring self-evaluation and diary-based data collection through a web-based interface comparison of three methods for ascertainment of contact information relevant to respiratory pathogen transmission in encounter networks social contacts of school children and the transmission of respiratory-spread pathogens social contacts and mixing patterns relevant to the spread of infectious diseases using time-use data to parameterize models for the spread of close-contact infectious diseases collecting closecontact social mixing data with contact diaries: reporting errors and biases dynamics of person-to-person interactions from distributed rfid sensor networks simulation of an seir infectious disease model on the dynamic contact network of conference attendees what's in a crowd? analysis of face-to-face behavioral networks a highresolution human contact network for infectious disease transmission high-resolution measurements of face-to-face contact patterns in a primary school using sensor networks to study the effect of peripatetic healthcare workers on the spread of hospital-associated infections close encounters in a pediatric ward: measuring face-to-face proximity and mixing patterns with wearable sensors modelling disease spread through random and regular contacts in clustered populations models of epidemics: when contact repetition and clustering should be included prioritizing healthcare worker vaccinations on the basis of social network analysis nurses' contacts and potential for infectious disease transmission superspreading and the effect of individual variation on disease emergence peripatetic health-care workers as potential superspreaders severe acute respiratory syndrome-singapore nosocomial transmission of measles: an updated review hospital-acquired influenza: a synthesis using the outbreak reports and intervention studies of nosocomial infection (orion) statement monitoring hand hygiene via human observers: how should we be sampling? global perspectives for prevention of infectious diseases associated with mass gatherings cough-generated aerosols of mycobacterium tuberculosis: a new method to study infectiousness invited commentary: challenges of using contact data to understand acute respiratory disease transmission modeling and estimating the spatial distribution of healthcare workers a low-cost method to assess the epidemiological importance of individuals in controlling infectious disease outbreaks we are particularly grateful to all patients and the hospital staff who volunteered to participate in the data collection. key: cord- -yfn sy m authors: fraser, christophe title: estimating individual and household reproduction numbers in an emerging epidemic date: - - journal: plos one doi: . /journal.pone. sha: doc_id: cord_uid: yfn sy m reproduction numbers, defined as averages of the number of people infected by a typical case, play a central role in tracking infectious disease outbreaks. the aim of this paper is to develop methods for estimating reproduction numbers which are simple enough that they could be applied with limited data or in real time during an outbreak. i present a new estimator for the individual reproduction number, which describes the state of the epidemic at a point in time rather than tracking individuals over time, and discuss some potential benefits. then, to capture more of the detail that micro-simulations have shown is important in outbreak dynamics, i analyse a model of transmission within and between households, and develop a method to estimate the household reproduction number, defined as the number of households infected by each infected household. this method is validated by numerical simulations of the spread of influenza and measles using historical data, and estimates are obtained for would-be emerging epidemics of these viruses. i argue that the household reproduction number is useful in assessing the impact of measures that target the household for isolation, quarantine, vaccination or prophylactic treatment, and measures such as social distancing and school or workplace closures which limit between-household transmission, all of which play a key role in current thinking on future infectious disease mitigation. the household is a fundamental unit of transmission for many directly transmitted infections. in addition, the household provides a ''laboratory'' within which key measures of transmission such as infectiousness, generation time and the effect of immunity or vaccination can be studied [ ] . in recent years considerable effort has gone into understanding the dynamics of transmission within populations organised into households using mathematical models [ , , , , ] . most effort has gone into analysing the asymptotic behaviour of these models, elucidating the threshold levels of transmission required for infection to be self-sustaining, calculating final epidemic sizes, or predicting the impact of generalised or targeted interventions designed to reduce or eliminate transmission. in parallel, methods have been derived to estimate the parameters which govern transmission within the household from detailed case reports [ , , , ] . however, scant effort appears to have been paid to how to apply household structured models to the analysis of epidemics, either retrospectively or in real time. concurrently, mathematical models have played an ever greater role in interpreting and responding to emerging pathogens. these models have typically been either of the ''simple but tractable'' variety which ignore or average over demographic structure and social mixing patterns [ , ] or the ''complex computer simulation'' variety that capture many details of demographic structure and dynamics, but of whom the behaviour can only be determined by intensive numerical analysis [ , , ] . the aim of this study is to develop methods of a perhaps ''slightly less simple but still tractable'' variety that capture some of the detail that micro-simulations have shown is important, but which can be rapidly applied (say on a daily basis) in an emerging outbreak situation, to inform policy. more specifically, the aim is to arrive at a method to estimate the key transmission and control parameters for a model of transmission within and between households from as few detailed observations as are likely to be gathered in the heat of a major outbreak. the resulting analysis will still be based on major simplifications in respect to all the spatial and other social constructs that govern disease transmission, but less so than those based on the very simplest assumption of free, homogeneous mixing. in this context, it should be stated that even in the best, most robustly parameterised microsimulations, gross approximations are made in describing the fabulously complex web of human behaviour, and even they are only attempts to characterise the statistical properties of the system as a whole. extensive effort is, and should continue to be, spent on identifying the conditions where different types of simplification (household models, static network models, spatial metapopulation models…) can and can't be justified, and in developing analytical approximations to describe disease transmission within such simplified structures. individual based simulations of influenza and smallpox pandemic spread and control, incorporating detailed information on population density, age structure, commuting patterns, workplace sizes and long-distance travel have highlighted the particular importance of the household as a fundamental unit of transmission [ , , , , ] (and reviewed in [ ] ). pure household models have been used fruitfully to explore detailed policy options in a city-wide response to an influenza pandemic [ ] . it thus seems a priori that household models are a natural starting point in terms of extending theory previously developed for the simplest assumption of homogeneous mixing. the analysis presented here will focus on deriving new estimators for individual and household reproduction numbers, denoted r (t ) and r * (t ) respectively. the individual reproduction number r (t ) is defined roughly as the average number of people someone infected at time t can infect over their entire infectious lifespan; as i will show below, there are several ways of defining this more precisely. the household reproduction number r * (t ) is defined here as the average number of households a household infected at time t can infect [ , ] . the individual reproduction number r (t ) rightly plays a privileged role in epidemiology, as it is a meaningful measure within any contact network. however, of the possible summary measures of epidemic progress, it is not necessarily the most useful. for example, for an emerging directly transmitted pathogen, such as pandemic influenza virus, public health interventions may target the household rather than the individual, enforcing household quarantine as well as offering antivirals to the household to limit transmission within the household. in such a situation, the household reproduction number r * (t ) is more directly related to the parameters which characterize the intervention, and is thus a better measure of the effect of these interventions. these quantities (r (t ) and r * (t )) share the two essential properties of reproduction numbers, namely that they increase when infectiousness increases and decrease when infectiousness decreases (monotonicity), and that they mark a threshold that separates exponentially growing epidemics (when r (t ). or equivalently r * (t ). ) from exponentially declining epidemics (when r (t ), or equivalently r * (t ), ) [ , ] . the structure of the paper focuses first on deriving estimators for individual reproduction numbers, then on household reproduction numbers and finally on examples of pandemic influenza dynamics and measles. though less well known than their compartmental counterparts (sir, sis, etc…), time-since-infection models offer a more intuitive starting point for modelling infectious disease transmission, and importantly for this application, they provide two other major advantages. first, it is typically easier to identify their key parameters, and second they more readily adapt to describe multi-level transmission (by multi-level, i mean here withinhousehold and between household). a disadvantage is that it can be harder to include heterogeneities. nomenclature is confusing, since both types of model have their origin in the same classic paper of kermack and mckendrick [ ] , and both the sir model and the simplest time-since-infection model are known as ''the kermack-mckendrick model''. the model, in the formalism chosen here, predicts the changing incidence rate i (t) as a function of calendar time t in terms of the transmissibility, denoted b (t, t ), an arbitrary function of calendar time t and time since infection t. b (t, t) typically reflects pathogen load, or perhaps more precisely pathogen shedding. it is commonly a single peaked function reflecting pathogen growth followed by immune suppression, or host death, but can be more exotic such as the double peaked profile associated with early and late transmission of hiv [ ] , or the repeated peaks of malaria [ ] . b (t, t) also reflects the effective contact rate between infectious and susceptible individuals, which can change during the course of a single infection, increasing for example if a person coughs or sneezes due to respiratory disease, or decreasing if a person takes to bed with illness, and during the course of the epidemic as public health measures are implemented. more discussions of the components (infectiousness and contact) of b (t, t) can be found in [ ] . because i am interested in outbreaks of emerging infections, i will not describe explicitly reductions in the susceptible population caused by the epidemic. formally this corresponds to working in the infinite population limit. this assumption is not essential for this section however, since b (t, t) could also be thought of as incorporating the proportion of cases that are susceptible; the assumption becomes more important in the later sections on household models. mathematically, transmission is defined by a poisson infection process such that the probability that, between time t and t+d, someone infected a time t ago successfully infects someone else is b (t, t)d, where d is a very small time interval. this assumption then results in a prediction that the mean incidence i (t ) at time t follows the so-called renewal equation this equation states that the number of newly infected individuals is proportional to the number of prevalent cases multiplied by their infectiousness. it may often be convenient (and realistic) to truncate the function b (t, t) at a time t m such that b (t, t) = for all t.t m . the asymptotic behaviour of incidence i (t ) is determined by reproduction numbers [ , ] . two intuitively defined reproduction numbers are the case reproduction number, which i denote r c (t ), and the instantaneous reproduction number, which i denote r (t ). the case reproduction number r c (t ) is a property of individuals infected at time t, and is the average number of people someone infected at time t can expect to infect. for a person infected at time t it is the total infection hazard from time t onwards, i.e. while the case reproduction number has been widely used, it may also be worth considering a quantity which i call the instantaneous reproduction number r (t ), a property of the epidemic at time t. it is the average number of people someone infected at time t could expect to infect should conditions remain unchanged. it is given by to illustrate the distinction between r c (t ) and r (t ), consider a situation where the transmission rate is abruptly reduced at a time t = t i . the instantaneous reproduction number r (t ), which estimates how many people one case would infect if circumstances were to remain fixed, would abruptly switch from a high to a low value at time t i . the case reproduction number r c (t ), on the other hand, estimates how many people each case actually infects. it will thus account for the fact that someone infected at time t,t i may spend part of their infectious period before and after the reduction in transmission which occurs at time t i and thus r c (t ) will smoothly transition from higher to lower values. to derive simple estimating equations for r (t ), i consider the case where this function is separable, which corresponds to saying that the relative progression of infectiousness as a function of time since infection is independent of calendar time. in this case b (t, t) can be written as the product of two functions w (t ) and w (t), i.e. b t,t ð Þ~w t ð Þw t ð Þ ð Þ a counter-example might be when reactive patient isolation is introduced and acts to reduce infectiousness in late stage infection, in which case b (t, t) can't be decomposed in this way. for this type of situation, it may be reasonable to assume the b (t, t) can be decomposed separately in different stages of the epidemic, pre-and post-implementation of isolation measures, for example. since b (t, t) is a product, i can arbitrary normalise one or other of the functions w (t) and w (t), so without loss of generality, i choose w (t) to have total integral , i.e. Ð ? w t ð Þdt: . the function w (t) is equal to the instantaneous reproduction number r (t). the function w (t) is then the distribution of how these infection events are distributed as a function of time since infection t. this is an idealised definition of the generation time distribution, which i denote w (t). thus, infectiousness can be decomposed as the product of the instantaneous reproduction number and the generation time distribution, i.e. the relationship between the idealised generation time distribution w (t) and the distribution of observed generation times can be rather complex for a number of reasons. first, infections are rarely observed, and thus must be either backcalculated or the generation times must be based on a surrogate such as the appearance of symptoms [ , ] . second, right censoring can cause the observed generation times to be shorter or longer than expected for a growing or declining epidemic, respectively [ ] . third, as apparent here, if the reproduction number r (t) changes due to depletion of susceptibles, changes in contact rates or public health measures, then this will also change the observed generation times for infectious individuals during that period of change. thus the distribution w (t) is really intended as a measure of infectiousness which will correspond to generation times for an index case in an ideal large closed setting where contact rates are constant. it can be inferred from data on the timing of cases, as in [ , ] . inserting ( ) into ( ) yields a novel estimator for instantaneous reproduction number by substituting the decomposition ( ) into equation ( ) , a relation between the instantaneous and case reproduction number is obtained: i.e. the case reproduction number is a smoothed function of the instantaneous reproduction number. usually, incidence is reported as a discrete time series of the form i i incident cases reported between time t i and time t i + , in which case the generation time distribution should be appropriately discretised into a form w i such that p n i~ w i~ . the estimators for the reproduction numbers become and equation ( ) was proposed by [ , ] as a real time estimator of the reproduction number, while equation ( ) was first used for analysing polio transmission in india [ ] (based on the work presented in this manuscript). while the case reproduction number is an intuitively appealing quantity, the instantaneous reproduction number estimated by equation ( ) should also be considered for practical applications as it may suffer fewer problems of right censoring in an incompletely observed epidemic. right censoring is a real problem in using the case reproduction number to track an epidemic in real time, since the estimator for r c (t) at time t is seen in equation ( ) to rely on knowing the incidence at future time-points. an algorithm to deal with this issue was proposed by [ ] , but switching instead to the instantaneous reproduction number estimated by equation ( ) may be a simpler solution. right censoring is not however the only complication associated with estimating reproduction numbers in practice, and is not completely absent from ( ) due to the delay in detecting infections. left censoring may also arise due to not knowing the baseline number infected if an epidemic has been unfolding for some time before observations are recorded. finally, estimating the generation time distribution may not be straightforward. several strategies are possible to deal with the fact that one never observes infections, but rather as a time series of cases of the form c i , where case definitions could be based on symptoms, hospitalisation or seroconversion. one strategy, used in [ ] , is simply to ignore this and use cases as surrogates of infection for estimation of both the generation time and the reproduction numbers. often though, it may be possible to characterise a distribution of the time from infection to becoming a case, say j i where p n i~ j i~ . if a case is defined by symptoms then this would be the incubation period distribution. one can then backcalculate incidence as follows a drawback of this approach is that the estimated incidence time series iˆi will tend to be over-smoothed relative to the original time series i i . it also makes clear that there is still a problem of right censoring in an incompletely observed epidemic in the estimator of equation ( ), though less than in equation ( ) . statistical properties of these estimators are straightforward [ , ] . one previously noted point [ , ] is that because these estimators are essentially ratios of incidences, they can be used in cases where only a fraction of cases are observed, such as for polio where only a tiny fraction of infections lead to disease (of the order of in ), though the confidence intervals will change. a special case applicable to many cases where surveillance is poor is when only the epidemic growth/decline rate r is known. in this case the incidence takes the form i (t) = i ( )exp (rt) and both estimators ( ) and ( ) for the reproduction number become where the reproduction numbers are now expressed as a function of the exponential rate of change r. this is likely a useful formula, presented and studied in detail in [ ] , where the links to earlier ecological and demographic modelling were also highlighted. much of the subsequent analysis will concern itself with deriving an equation equivalent to ( ) for the household reproduction number r * (r). the model defined above assumes that the function b (t, t) describes the ''natural history'' of infection in each infected individual. before specialising to the model of household transmission, it is first worth considering the case where different individuals experience different ''natural histories'', defined here by the susceptibility to infection, and infectiousness after infection. i denote a vector of random variables x = {x , x , …} to describe factors which influence susceptibility or infectiousness. for example for the standard seir model of infection the random variables would be the durations of the latent period (l) and the infectious period (d), i.e. x = {l,d}. let f (x ) denote the probability distribution of these random variables amongst new infections (taking into account differences in susceptibility), defined such that where the integral is taken over the domain of the random variables. in other words, f (x ) is the proportion of new infections that have state x. let b (x, t, t) denote the infectiousness profile of an individual with state x. assuming that all individuals mix homogeneously, then the transmission model defined earlier by equation ( ) is generalised to where i (x , t) is the incidence of infections with state x. i define the function k(t) to denote the integral which clearly depends only on time t and not state x. the total incidence at time t is defined by the integral by substituting equation ( ), which can be rewritten as i (x, t ) = f (x ) k (t ), into equation ( ), i obtain that k (t ) = i tot (t ) and thus that i can now substitute ( ) into ( ) to obtain dividing both sides of this equation by f(x)yields an equation for the total incidence if i define the average infectiousness as follows then equation ( ) can now be seen to be the standard kermack-mckendrick model of equation ( ), i.e. in other words, in this model of an emerging infectious disease epidemic with heterogeneities in susceptibility and infectiousness, the dynamics of mean total incidence of infection is exactly equivalent to the basic model where the infectiousness is appropriately averaged using equation ( ) . once an expression is derived for the average infectiousness b (t, t), the results such as equations ( ) or ( ) can be used without further consideration of the heterogeneities in infectiousness or susceptibility. heterogeneities which are transmitted or preserved from one infection to the next, for example due to non-random mixing between different risk groups, a situation not considered here, lead to a more complex result. some public health interventions such as isolation and contact tracing can induce such heritability even if it is not a basic property of the transmission process [ , ] . a useful exercise in applying this formalism (not elaborated here) is the derivation of standard formulae for the basic reproduction number as a function of the exponential growth rate r for the seir model [ ] . one approach to estimating household reproduction numbers is simply to switch perspective from individual to household, directly estimate the generation time distribution (times taken for one household to infect another) and incidence of infection of households, and apply the results of equations ( ) or ( ) to estimate reproduction number as a function of time, r * (t), or exponential growth rate, r * (r). because, as i have shown, the linearised kermack-mckendrick model is applicable even when susceptibility and infectiousness are heterogeneous, this method is acceptable despite the fact that households may be quite heterogeneous in size and in the number of people infected. one analogous situation where this approach has been used is in estimating farm-to-farm reproduction numbers in the uk foot-and-mouth virus epidemic [ ] . however, unless specifically tailored to this task, it is unlikely the data will be collected in the requisite form for this approach to be used in the human household situation. thus, in this section i explore the alternative approach of explicitly modelling transmission within and between households. homogeneous transmission models can be interpreted as twolevel hierarchical models, where the processes which guide the natural history of infection within the host are considered separate from those which drive transmission between hosts. the link between the two can be thought of as the function b (t, t) which translates the impact of changing processes within the host into changing infectiousness as a function of time since infection. the approach taken here to modelling household transmission is to study a three-level hierarchical model of transmission. the three levels are within-host, within-household, and between households. the natural history of infection is described by the individual infectiousness function b (t, t). i assume in this section that individuals are homogenous in infectiousness and susceptibility. i then use this to predict the course of epidemics within households, and derive a function b * (t, t * ) which describes the average infectiousness of a household towards other households as a function of the time since the household was infected, t * (from here-on, i use the starred symbols to denote properties of households, and un-starred symbols to denote properties of individuals). the basic idea behind this analysis is illustrated in fig . to simplify the notation, and because the main aim of this section is to study the case of an epidemic growing exponentially, i consider the situation where infectiousness is independent of calendar time t. this could be relaxed, though only if variation in time is somewhat slower than the typical duration of infection within a household. more specifically, the model assumptions are that: n individuals are distributed into households, and mix randomly and homogeneously outside of their household; n within a small time interval d, an individual who has been infected a time t ago infects a person at random in the population with probability b g (t )d; n within this same time interval he or she infects each susceptible individual in his or her household with probability b l (t, n)d (this is allowed to depend on the household size n, since empirical evidence suggests such variation may occur [ ] ); n the population is large, and the disease has low prevalence, so that the probability of a household being repeatedly infected is negligible; n the functions b g (t ) and b l (t, n ) are proportional to each other as functions of the time since infection t. as a result of the last assumption and of the discussion around equation ( ), the infectiousness functions can be decomposed as b g (t) = r g w (t) and b n (n, t) = r n w (t), where r g is the average number of people each infected individual infects through random (non-household) contacts, w (t) is the generation time distribution for between household transmission, and r n is a parameter describing infection within the household whose interpretation will be clarified below. i start by analysing the process of transmission within a single infected household of size n in terms of the functions r n and w (t). consider first a household of size , where one individual is infected at time t * = . given the poisson process described by the assumptions listed above, the probability that the second individual the probability that the second person is never infected is the distribution of times of infection of the second individual, conditional on infection, is then where l q (t * )/lt * is the rate of change of the cumulative probability of not being infected, i.e. the probability density of being infected at time t * , and the normalising factor -q is the total probability of being infected. the difference between w (t * ) and the standard generation time distribution w (t) is a saturation effect, so that the second case tends to get infected earlier as the infectiousness of the index case (r ) is increased. the infectiousness of the second individual towards other nonhousehold members of the population, conditional on his or her infection, and described as a function of the time t * since the infection of the household is thus the convolution of w (t * ) and b g (t), so that the total infectiousness of the household is generalising this exact result to larger households involves some complications. consider for example a household of size , where one individual is infected at time t * = . the probability that neither of the other two individuals is infected by the first individual at time t * is q t à ð Þ~exp {r Ð tà w s ð Þds À Á directly analogous to the situation for households of size . however this is somewhat greater than the actual probability that they are not infected at all, since once one of these two is infected, they can also infect the other, and thus the probability that they each escape infection is somewhat less than q :q ? ð Þ~exp {r ð Þ. to progress further with analysing this system, i propose to approximate the process by assuming that infections within a household can be approximately described by a discrete generation reed-frost model, i.e. where the probability of not being infected in each generation is (q n ) m where m individuals are infected in the previous generation and q n u exp ( r n ). q n is the escape probability of each infectious-susceptible pair of individuals considered in isolation. in the formalism proposed by ludwig, this corresponds to using infectious rank as a surrogate for infectious generation [ ] . dynamics are recovered by assuming the times between generations are described by the standard generation time distribution w (t). the ordering of infection events has no influence on the final number of individuals infected [ ] , and therefore this approximation will produce exact results for the final number of people infected in each household. because of the possibility of ''later'' generations preceding ''earlier'' ones, as noted in the case of households of size above, and because of ignoring the saturation effect present in equation ( ) in terms of the actual generation times within households, this approximation will overestimate the time taken for individuals to become infected in the household. because of the general form of the relation between generation time and reproduction number seen in equation ( ) , this will result in over-estimates of the household reproduction number r * (r). to provide a counter-balancing under-estimate of r * (r), i also consider an alternative approximation obtained by assuming the same total number of cases as predicted by this reed-frost model, but where all cases are assumed to be infected by the first index case. this is not a formal lower bound, since in the limit of infinite infectiousness within the household, all members of the household will be infected simultaneously upon introduction of the infection into the household. i find however that even for the example of highly infectious measles virus (below), the under-approximation is sufficient to provide a practical lower bound. the probability of different chains of infection within households can easily be computed from the assumed reed-frost model [ ] . i denote pr( {m , m , …, m n }|n) the probability of a chain of infection occurring in a household of size n where m index cases infects m , who in turn infect m tertiary cases and so on, up to a maximum of n generations of infection. it is an assumption of the where pr m iz j m , . . . ,m i f g ,n ð Þbinomial the second approximation is that the time taken for one infected to infect the next is distributed according to the standard generation time distribution w (t). the time at which someone in the (i+ ) th generation of infection is infected is as a result drawn from the i th auto-convolution of this distribution, denoted here w [i] (t * ) and defined by the recursive convolution equation which satisfies . consider now an individual in the i th generation of infection in the household, and consider this household at a time t * after the first index case was infected. this individual must have been infected at some earlier time s ( t * distributed according to the distribution w [i- ] (s). his or her infectiousness to others outside of the household will be given by b g (t * -s). thus, by averaging over all possible values of s, the average infectiousness of such an individual in the i th generation is thus having averaged over all possible times of infection in the chain of transmission events in the household, infectious households are stratified by their size and by the number of cases in each generation. using the notation defined earlier, i define the state vector x = {n, m , …, m n } of variables which define the infectiousness and susceptibility of the infected household, where n is the household size and m i is the number of infected individuals in the i th generation of infection in the household. the infectiousness of a household with this state x towards other households, mediated by random mixing of individuals between households, is the sum of the infectiousness of all the individuals each given by equation ( ) given that this infection process involves random mixing of individuals outside their household, the distribution of sizes of households which get infected is the so-called size-biased household distribution. this is the distribution of sizes one obtains by sampling individuals at random in the population and recording the size of their household, as opposed to the more commonly recorded household size distribution which is obtained by sampling households at random. if k n denotes the household size distribution, then is the size-biased household size distribution. given a household of size n gets infected, the probability of a chain of infections is given by the reed-frost probabilities pr ({m , …, m n }|n). the distribution of the random variables x = {n, m , …, m n } at infection is thus f x ð Þ~k n : pr m , . . . ,m n f g j n ð Þ ð Þ the mean infectiousness of a household is let m = s i m i be the average total number of cases in an infected household. the household reproduction number takes an intuitive and well known form derived in [ , ] , expressed in terms of the parameter r g as follows: i.e. the household reproduction number is the product of the expected number of infections in a household multiplied by the number of people each individual infects out of their household. the mean household generation time distribution (time for one household infecting the next) is the mean generation time for households, t g * , can be expressed in terms of the individual generation time t g as the generation time distribution w * (t * ) can be used for the previously defined estimators of reproduction numbers ( )-( ) using household incidence data or just exponential growth rates. the exponential growth rate r for an exponentially growing epidemic is the same whether measured for individual or household incidence. for an exponentially growing or declining epidemic, one obtains the estimator now consider the integration where the first equality uses the definition of the auto-convolution, the second is a re-ordering of integrals, the third involves changing variables to u = t * -s, the fourth is a factorisation and the fifth arises by induction. the sixth uses the definition of the individual reproduction number r (r ) one obtains ignoring household structure from equation ( ) . the household reproduction number can be expressed in terms of the individual reproduction number r (r ) as examination of equation ( ) immediately reveals that the estimate for the number of people each person infects out of the household is i have thus derived a simple analytic relation between the individual and household reproduction numbers. both are approximations, ignoring the effects of local saturation on the generation time, which will tend to produce overestimates of the reproduction number. an alternative approximating to the household reproduction number, which provides an underestimate, is found when all secondary household cases are assumed to arise in the second generation, i.e. using equation ( ) there are two reasons for considering household structure in analysing the pandemic influenza situation. first, influenza transmission is known to be concentrated within the household, and thus parameter estimates which ignore this heterogeneity are likely to be frail. second, many public health policies for future pandemics are likely to be organised around the household. the net effect of social distance measures such as school and workplace closures and cancellation of social gatherings is effectively to reduce transmission out of households (and perhaps inadvertently to increase transmission within them). furthermore, antiviral treatment and prophylaxis and quarantine measures are likely to be targeted at whole households rather than individuals (though restricting families with one suspect case to stay together without any other support is possibly undesirable) [ , , ] . a number of studies have identified the parameters needed to estimate the household reproduction number for influenza [ , , , ] . it is important to bear in mind that these parameters could be quite different in future pandemics, and thus that robust methodology may be more useful in responding to new outbreaks than numerical estimates obtained for past outbreaks. while it would be straightforward to use demographic data and exponential growth rates from earlier pandemics combined with interpandemic data on the transmissibility of influenza within households to obtain estimates of r * for historical pandemics, it has not been shown that the within household transmission parameters for inter-pandemic influenza adequately describe the pandemic situation, so i focus instead on providing illustrative examples using current demographic data (on the household size distribution from the uk) [ ], and recent data on the transmissibility of influenza in modern households [ ] . the household size data from is truncated to size , and i assume that all households of size or greater have size exactly . the data are k = % (i.e. % of households are single person households), k = %, k = %, k = %, k = % and k = %. the size of the mean household is thus . (average size of households where households are sampled at random), while the household of the mean individual has size . (average size of household to which individuals belong, where individuals are sampled at random). from the french influenza study [ ] , i obtain maximum likelihood estimates of the within household transmission parameter of r n = . /n . (which is consistent with the best fit to the tecumseh data [ ] of r n = . /n . ). the former study followed seronegative households for a two week winter outbreak of seasonal influenza. the corresponding escape probabilities are q = . % (i.e. the probability of not being infected by the other household member in a household of size two is . %), q = . %, q = . %, q = . % and q = . %. on the scale of other infections, this places influenza as being approximately as infectious as mumps, but a lot less infectious than either varicella-zoster or measles [ ] . by applying the reed-frost model to these data with this distribution of households, i obtain estimates of the average number of infections in each generation of infection of m u , m = . (i.e. the first index case directly infects an average of . people in his or her household), m = . , m = . , m = . and m = . , and thus the estimate for the total expected number of cases in an infected household is m = s i = m i = . , to be compared to the mean size of . . these calculations are performed in microsoft excel using equation ( ) . there is not yet a consensus on the generation time of influenza [ , , , , ] , with estimates ranging from . days in [ ] to . days in [ ] . i use a gamma distribution with mean t g = . days and standard deviation . days, as reported in [ ] . based on these data, i compare the predicted and simulated infectiousness of households in fig. , which shows the average over all households sizes and compares this to the final analytical approximation given by equation ( ) for b * (t * ), and also the alternate approximation which considers all secondary infections to arise in the second generation of infection ; the simulations and the first approximation are clearly in good agreement. individual based stochastic simulations were programmed using berkeley madonna, and are described in appendix s . for the case of an exponentially growing epidemic, the estimates of the individual and household reproduction numbers, r and r * respectively, are shown in figure , along with the estimate of the number of people one person infects outside their household, r g . for r * , both the under-and over-estimating approximations are shown, along with estimates obtained from the simulated generation time distribution. as expected for this low-infectiousness scenario, the simulated values are closer to the overestimating approximation. the range between these approximations which bracket the true value is rather narrow, indicating that the method is predictive. for the ''spanish flu'' h n pandemic, the median growth rate in large us cities was r = . per day [ , ] , with comparable estimates in the uk [ ] . this value also serves as an upper estimate for the spread of the h n pandemic virus in [ ] . based on this growth rate, the estimated individual reproduction number is r = . , while the estimated household reproduction number is r * = . , and thus the out-of-household reproduction number is r g = . . of course, households were bigger in than now, so that the actual value of r * was likely higher than this. these estimates would imply that a proportion /r * = % of between household transmission would need to be blocked to prevent epidemic spread. figure could provide a rough guide to the likely values of r * and r g for a new influenza pandemic where the rate of exponential growth can reliably be determined. consider someone who the index case in their household; they would be expected to infect r g = . people out of their household and m = . within their household. this validates assumed proportions of transmission within and between households from earlier simulation studies [ , ] . the sum of these is greater than r since the reproduction number r is an average over different generations of infection within the household. for this value, the estimate of r which takes into account local saturation effects was determined numerically to be r = . . fig shows that for all values of r, numerically estimated values for r (r ) are close to the curve estimated from application of equation ( ) which ignores local saturation effects. as a final check of the method, epidemics within a community of , households were simulated using an individual based stochastic model (see appendix s ). i choose r g = . as inferred from an epidemic growth rate of r = . per day, and the other parameters as described above. the exponential rate of growth was then re-estimated directly from the simulated incidence timeseries to be r = . (figure ) , close to the predicted value of r = . . this provides further support for the validity of this method, especially since no restrictions were placed on re-infection of households within this small simulated community. as noted above, influenza is relatively uninfectious compared to other common viruses. for a contrasting application of the method, i now focus on measles which was the most infectious of the pathogens studied in [ ] . measles also has a more peaked generation time distribution, so that generations of infection are more distinct, and to make the contrast with the influenza estimates yet greater, i also use demographic data on household size chosen from the national census in , when household sizes were greater than they are now. this analysis is perhaps a little artificial when applied to measles, since a large proportion of the population will have immunity either due to past infection or vaccination with the live mmr vaccine. the principal motivation is to further test and illustrate the methods in a case where good data on the transmission dynamics within households are available. stratification by household of the recent outbreaks of measles caused by decreasing uptake of the mmr vaccine could reveal whether household heterogeneities should have be accounted for in estimating the changing reproduction number of measles [ ] . the household size data from is truncated to size , and i assume that all households of size or greater have size exactly . the data are k = % (i.e. % of households are single person households), k = %, k = %, k = %, k = % and k = %. the size of the mean household is thus . (average size of households where households are sampled at random), while the household of the mean individual has size . (average size of household to which individuals belong, where individuals are sampled at random). hope-simpson reported susceptible-infectious escape probabilities of q = . % for mumps, q = % for varicella, and q = . % for measles in under s [ ] . the results were reported independent of household size, and were regarded as unreliable in over- s. based on applying the reed-frost model to the measles (t) ) is shown, as is the infectiousness of the typical infected household (denoted b * (t * )). this latter curve is obtained by simulating over , epidemics of transmission within households starting from one infected case. the two analytical approximations described in the text are also shown. ''approx '' is the main approximation described, while ''approx '' is the one obtained by assuming that all infections occur in the second generation of infection within the household. parameters are as described in the main text, and the curves are arbitrarily scaled such that each individual infects on average one person outside of the household (i.e. r g = ). doi: . /journal.pone. .g estimate with this distribution of households, i obtain estimates of the average number of infections in each generation of infection of m u , m = . (i.e. the first index case directly infects an average of . people in his or her household), m = . , m = . , m = . and m = . , and thus the estimate for the total expected number of cases in an infected household is m = s i = m i = . , to be compared to the mean size of . . hope-simpson also reported the intervals between linked cases in households using different case definitions [ ] ; the intervals for what he regarded as the most reliable case definition, ''maximum rash''. these data is well described by a gamma distribution (not shown). the maximum likelihood estimate of the generation time is t g = . days with standard deviation . days. based on these data, i repeat the simulations of the previous section on influenza but with parameters for measles in figs , and . figure b shows that, as expected, the average infectiousness of a household is less well approximated by either approximation than for the much less infectious case of influenza. in this case, multiple peaks of infectiousness corresponding to generations of infection within the household can be clearly distinguished, and there are more cases in the second generation of infection than in the first. in terms of the predicting of the household reproduction number r * , the method is still found to be strongly predictive (as evidence by the small gap between upper and lower estimate) and reliable (compared to numerical estimates). while in influenza, the simulations were close to the upper approximation, here they are closer to the lower approximation, as expected for the more infectious situation of measles transmission. simulations of transmission within a community of households were again found in figure b to validate the approach. the difference in the shape of the epidemic curve with influenza reflects the different shape of the generation time distribution, though the exponential growth rate is the same. new methods were presented to estimate both the individual and household reproduction number during an epidemic. the new method presented for estimating the individual reproduction number relates closely to earlier work [ , , ] , but provides an alternative and possibly simpler solution to the problem of incomplete observations during an unfolding epidemic [ ] . it also provides an alternative and perhaps more satisfying solution than the incidence-to-prevalence ratio method [ , ] to the problem of long generation time distribution infections such as hiv, where epidemiological circumstances can change substantially within the course of a single infection, and thus the case reproduction number represents too much of an average to convey secular changes in behaviour and transmission. nothing in this study challenges the central role of the individual reproduction number as an epidemiological measure; because the empirical measures of reproduction number proposed here and in [ , , ] use incident observed cases as the base, all of the complication in defining the 'typical' or 'eigen' case for structured models discussed most clearly in [ ] are neatly sidestepped. what this study does highlight is that much complexity is hidden in effectively defining and estimating the generation time distribution for a structured population. in the case studied here, generation times between individuals are shorter for within household transmission than between household transmission, particularly for more infectious pathogens, and this resulted in systematic biases associated with estimating the reproduction number while ignoring this effect, which were quite substantial in the case of highly infectious measles virus. the methods presented for the estimation of household reproduction numbers were not affected by this problem in the same way. analytical approximation were derived which bracketed estimates between a lower and upper bound, and numerical simulations showed the range within these brackets to be narrow. these approximations were shown to be robust, but it is worth noting that assumptions are made about the population mixing randomly out of their households and results are only valid in the scenario of an emerging pathogen where overall prevalence is low. the usefulness of these methods is likely to be found in predicting and understanding the impact of household targeted infection control measures in an emerging epidemic. this actually covers a wide class of interventions since the household is a central living and administrative unit in most populations. decisions regarding isolation, quarantine, vaccination and prophylaxis may often be made for entire households. similarly school and workplace closures as well as restrictions on leisure activity can be thought of as trying to reduce between household transmission. analytical approaches are also invaluable in calibrating and providing independent checks on more detailed individual based micro-simulations, such as [ , , ] . some control interventions require more subtle analyses; for example it has been shown that vaccinating whole households is not the most effective strategy for a given vaccine coverage rate, and that alternative strategies such as preferentially vaccinating larger households could be considered [ ] . further avenues of research include studying the statistical properties of these estimators for different situations. the assumption made here, that individuals mix nearly homogeneously out of their household may be an appropriate approximation for describing transmission within a neighbourhood or even a city [ ] , but ultimately one should also consider developing the estimators for more complex demographic situations such as a hierarchy of organisations (household, to village, to region, to country, etc…) or a more complex overlap of households, workplaces and regular social spaces. also of interest is the study of intervention measures, particularly those that respond to the presence of a symptomatic cases; the measures of pre-symptomatic transmission presented in [ ] clearly generalise to a household, but analytical results on the efficacy of isolation and quarantine are not evidently obtainable. the estimators of the household reproduction number have been shown here to be robust on their own terms, but i have not addressed the issue of model misspecification, for example to inaccurate determination of the generation time distribution or to individual heterogeneity in infectiousness or susceptibility within households. further scenarios could be explored both to test the method with different infections and to address the issue of model misspecification. there are many cases where it may be desirable to quantify household transmission, but where a degree of natural or vaccineinduced immunity may be present in the population, a problem not addressed here. in considering these more complex situations, while it may not be possible to obtain analytic forms for the infectiousness of a household, numerical forms can usually be obtained quickly and still offer benefits over full individual based micro-simulations in easily exploring a wide range of parameters. finally, the likely practical benefits of estimating household transmission parameters in an emerging epidemic need to be clearly established and communicated, and the most effective ways to enhance data collection protocols to allow their rapid estimation need to be identified. appendix s description of the simulations found at: doi: . /journal.pone. .s ( . mb pdf) infectiousness of communicable diseases in the household (measles, chickenpox and mumps) the mathematical theory of epidemics the effect of household distribution on transmission and control of highly infectious diseases a general model for stochastic sir epidemics with two levels of mixing preventing epidemics in a community of households epidemics with two levels of mixing household and community transmission parameters from final distributions of infections in households estimating household and community transmission parameters for influenza analyses of infectious disease data from household outbreaks by markov chain monte carlo methods a bayesian mcmc approach to study transmission of influenza: application to household longitudinal data transmission dynamics and control of severe acute respiratory syndrome different epidemic curves for severe acute respiratory syndrome reveal similar impacts of control measures strategies for containing an emerging influenza pandemic in southeast asia containing pandemic influenza at the source transmission dynamics of the etiological agent of sars in hong kong: impact of public health interventions mitigation strategies for pandemic influenza in the united states strategies for mitigating an influenza pandemic smallpox transmission and control: spatial dynamics in great britain large-scale spatial-transmission models of infectious disease reducing the impact of the next influenza pandemic using household-based public health interventions a contribution to the mathematical theory of rates of hiv- transmission per coital act, by stage of hiv- infection population dynamics of untreated plasmodium falciparum malaria within the adult human host during the expansion phase of the infection mathematical epidemiology of infectious diseases: model building, analysis and interpretation factors that make an infectious disease outbreak controllable a note on generation times in epidemic models transmission intensity and impact of control policies on the foot and mouth epidemic in great britain new strategies for the elimination of polio from india estimating in real time the efficacy of measures to control emerging communicable diseases how generation intervals shape the relationship between growth rates and reproductive numbers superspreading and the effect of individual variation on disease emergence the effectiveness of contact tracing in emerging epidemics final size distributions for epidemics transmissibility of pandemic influenza measles outbreaks in a population with declining vaccine uptake definition and estimation of an actual reproduction number describing past infectious disease transmission: application to hiv epidemics among homosexual men in denmark is hiv out of control in the uk? an example of analysing patterns of hiv spreading using incidence-to-prevalence ratios optimal vaccination schemes for epidemics among a population of households, with application to variola minor in brazil to check the method for consistency, i simulate ten epidemics of influenza (a) and measles (b) within a fully susceptible community of , households. i use parameters estimated for an epidemic growth rate r = . per day, and condition on nonextinction of the epidemic. in c and d, the natural logarithm of the incidence is compared to the fixed slope curve r = . predicted by the model (thick line). linear regression through these data yields the estimate r = . in both cases. doi: . /journal.pone. .g key: cord- -gxtvlji authors: bobrowski, tesia; melo-filho, cleber c.; korn, daniel; alves, vinicius m.; popov, konstantin i.; auerbach, scott; schmitt, charles; moorman, nathaniel j.; muratov, eugene n.; tropsha, alexander title: learning from history: do not flatten the curve of antiviral research! date: - - journal: drug discov today doi: . /j.drudis. . . sha: doc_id: cord_uid: gxtvlji here, we explore the dynamics of the response of the scientific community to several epidemics, including coronavirus (covid- ), as assessed by the numbers of clinical trials, publications, and level of research funding over time. all six prior epidemics studied [bird flu, severe acute respiratory syndrome (sars), swine flu, middle east respiratory syndrome (mers), ebola, and zika] were characterized by an initial spike of research response that flattened shortly thereafter. unfortunately, no antiviral medications have been discovered to date as treatments for any of these diseases. by contrast, the hiv/aids pandemic has garnered consistent research investment since it began and resulted in drugs being developed within years of its start date, with many more to follow. we argue that, to develop effective treatments for covid- and be prepared for future epidemics, long-term, consistent investment in antiviral research is needed. from time immemorial, infectious diseases have ravage q d mankind. only years ago, tb was still one of the top three leading causes of death in the usa [ ] . fortunately, science advanced dramatically during the th century, changing the ways in which our society treats infectious diseases. current preventative vaccines and drugs are catered to treat long-lasting and/or chronic infections or infectious diseases that recur annually or on a regular basis, such as hiv, tb, alexander tropsha is a k.h. lee distinguished professor and associate dean for data science at the unc eshelman school of pharmacy, unc-chapel hill. professor tropsha was awarded a phd in chemical enzymology in from moscow state university. his research interests are in the areas of computer-assisted drug design, computational toxicology, cheminformatics, (nano)materials informatics, and structural bioinformatics. his has authored peer-reviewed scientific papers, book chapters, and co-edited two monographs. his research has been supported by multiple grants from the nih, nsf, epa, dod, foundations, and private companies. hepatitis c, influenza, and so on. however, the major viral disease outbreaks that have plagued society over the past two decades do not follow this pattern. in fact, they have shown that the scientific community is not adequately prepared to offer or rapidly develop effective treatments when an outbreak happens [ ] . as a consequence, all countries have to adopt nontherapeutics measures to slow the progression of the epidemic and 'flatten the curve' to limit the burden of the disease on the healthcare system and allow better support to severely ill patients [ ] . a recent study in the new england journal of medicine [ ] estimated that the yearly cost of a pandemic could amount to us$ billion dollars being spent worldwide on treatment, control, and prevention efforts. as seen in the current outbreak of the sars coronavirus (sars-cov- ), this cost might be even higher because of restrictions on international trade and travel, closing of businesses, prohibition of large gatherings of people, and other social-distancing strategies [ ] . although these measures do help curb the spread of the disease, they have potentially devastating consequences: it was initially predicted that a large volume of cases over a short period of time would result in the usa having barely enough masks to last even weeks into the pandemic [ ] . there might also be other social and political ramifications: from a quick glance at the google trends data [ ] , one can see that, on super tuesday in the usa in , the popularity of google searches for 'coronavirus' was nearly three times that for super tuesday. also, because of prohibitions on gatherings of more than - people in many cities and states, people might have stayed away from the total number of clinical trials launched per outbreak as the function of time during the first weeks after the outbreak start date. the start date of each epidemic is defined as the date when authorities, such as the who, started listing data on the number of cases. time is normalized for each outbreak according to this start date (week ). we used the deposition dates and numbers of clinical trials as recorded in clinicaltrials.gov. abbreviations: covid- , coronavirus ; mers, middle east respiratory syndrome; sars, severe acute respiratory syndrome. polls out of fear of contracting the virus and might continue to do so in the future, possibly influencing the outcome of the us election [ , ] . the democratization of access to the internet has also facilitated the access of the general public to information through mediums such as major media outlets and even formal and informal data analytics [ , ] . for instance, johns hopkins university hosts a popular, regularly updated map of the reported cases of covid- around the world [ ] using data from the chinese centers for disease control and who situation reports. the new mantra, 'flatten the curve,' is also representative of the newfound exposure of the public to data science and analysis in response to the covid- pandemic [ ] . likewise, the rapid growth of global communications systems has allowed media, government, and scientists alike to quickly access and share a large amount of data. this real-time sharing of information has been unprecedented. although it permits governments to respond rapidly to epidemics through the dissemination of prevention and control methods, it can also facilitate public panic by stoking existing fears about an epidemic [ , ] . this rapid exchange of information applies to scientific data and publications as well: with increased access to the internet, the response of the scientific community has been enriched. we have observed an increasing number of articles being published for successive outbreaks in both peer-reviewed journals and various arxiv (see glossary) preprint servers over the past years. over the past few months, both peer-reviewed journals [ , ] and arxiv preprint [ , ] servers have been overpopulated with reports on known drugs or clinical candidates with possible anti-sars-cov- activity identified by computational approaches. however, despite many experimental and clinical studies, no effective drugs or treatments have emerged to treat the previous six epidemics of bird flu, sars, swine flu, mers, ebola, and zika as well as, thus far, covid- . this observation begs the question of whether the rapidity and bulk of immediate responses to epidemics are sufficient to enable the development of effective treatments. in this study, we investigated historical data for seven major disease outbreaks of the past two decades: bird flu (h n ), sars, swine flu (h n ), mers, ebola fever, zika fever, and covid- . we assessed the response of the scientific community to these outbreaks over time, in addition to how effective that response was in producing vaccines and small-molecule antiviral drugs. to this end, we analyzed the number of publications, clinical trials, funding levels, and google trends data from the start of these epidemics until the present day. we observed that there has been little success in combatting outbreaks effectively while they were occurring, let alone after they have passed. by contrast, we also observed that these trends were different for hiv/aids, which has received continuous and uninterrupted attention from researchers around the world and for which multiple targeted therapies have indeed emerged. we expect this analysis to provide insights as to how to better mobilize both federal agencies and scientists to find treatments for covid- as well as other future outbreaks. weeks after outbreak bird flu drug discovery today the evolution of the number of publications during the first weeks after their respective outbreak start dates. the start date of epidemic is defined as the date when authorities, such as the who, started listing data on the number of cases. time is normalized for each outbreak according to this start date (week ). the data on publications include both peer-reviewed papers and preprints. the data were obtained from pubmed, biorxiv, medrxiv, arxiv, and chemrxiv. abbreviations: covid- , coronavirus ; mers, middle east respiratory syndrome; sars, severe acute respiratory syndrome. www.drugdiscoverytoday.com we evaluated the number of publications (in both peer-reviewed journals and arxiv preprint servers) and the number of clinical trials performed over the course of the epidemic to estimate the engagement and success of the scientific community in response to the seven major outbreaks of the past two decades: bird flu, sars, swine flu, mers, ebola, zika, and covid- . these metrics indicate the velocity with which the scientific community mobilizes to seek solutions to remedy an outbreak, as well as how this velocity correlates with other metrics, such as the number of confirmed cases and the number of people who have died from the disease. in addition, we evaluated the number of unique molecules and/or treatments being tested in clinical trials, because many were replicates of each other. first, we looked at the response of the scientific community on weekly timescale (figs and ) . we examined the number of clinical trials (fig. ) and publications (fig. ) over the first weeks of each outbreak, with week corresponding to the time point when the federal authorities or the who first started reporting the data on the epidemic (for covid- , december , is considered the start date). on this standardized timescale, the number of clinical trials launched for covid- greatly outnumbered that of any of the previous epidemics; the growth rate of publications on covid- was also the highest. one can also see that, for more recent epidemics, such as ebola and zika, more clinical trials were launched during the first weeks of the epidemic than had been the case for previous epidemics, such as bird flu and mers (fig. ) . the only exception to this general observation was swine flu, which is an anomaly because h n flu strains had been researched extensively before the start of the outbreak in as a result of the spanish flu pandemic of and other past h n epidemics [ , ] . this is in stark contrast to ebola virus and zika virus, which had caused smaller-scale outbreaks previously, yet had little to no information available on how to treat them [ , ] . equally valuable is to compare the rate of response of the scientific community to the rate of epidemic growth. as a case study, we chose to compare data on sars to that of sars-cov- (covid- ). interestingly, the number of total publications for sars over time roughly followed the same trend as the number of cases, with an offset by a short time period of less than a month (fig. ) . likewise, rapid spikes in the number of cases correspond to slightly offset spikes in the number of publications on sars by around the same period ( days). in comparison to sars, covid- does not show clear spikes in the number of publications or clinical trials corresponding to peaks and dips in the number of new cases or deaths (fig. ) . instead, the trends in the number of publications and clinical trials appear to follow the trends in the number of cases and/or deaths. indeed, there is a correlation (r = . ) between the rate at which the virus spreads throughout the population and intensity of research on covid- , as measured by the number of publications. although looking for causality in these relationships is nonsensical, the number of new peer-reviewed publications appears to follow roughly the same logarithmic curve as the number of new cases recorded for covid- , whereas the rate of new publications in nonpeer-reviewed arxiv preprints appears to exceed even that of new covid- cases (fig. ) . the number of covid- preprints in arxiv servers surpassed the number of covid- papers in peer-reviewed journals in early march, highlighting the rise of online journals and preprint servers, as well as reflecting the shrinking period of time between the original observations and respective publications. this is also in direct contrast to sars: more papers were published at a faster rate during this pandemic than in the sars epidemic beginning in . for example, eight peer-reviewed papers on covid- had been published by the time there were cases of the disease (january , ), whereas sars only had one paper published by the time there were total cases (april , ) . however, it is necessary to contextualize these observations in terms of the scientific output, not just by the rate of response. in this regard, we observed that the large number of studies conducted on sars notwithstanding, no us food and drug administration (fda)-approved drug or vaccine to treat the disease has been developed in the -year period since the outbreak began (in ). it is of particular interest to look at the evolution of both the number of publications and the amount of research funding from the beginning of each outbreak to-date (fig. ). this analysis shows, perhaps not unexpectedly, that, following the spike of the research interest in the initial phases of each epidemic, progressively fewer research papers were published for previous outbreaks after they ended. this trend probably reflects the lack of special funding for such research and, consequently, the lack of successful therapeutic development against the respective diseases. indeed, fluctuations in research funding for each epidemic appear to follow the same spike-like trend as research publications (fig. ) . this lack of research interest and/or funding outside of periods when these outbreaks are occurring is probably partially responsible for the current situation: no approved anticoronaviral medications exist and, as such, the world is frantically looking to repurpose existing drugs, such as chloroquine and hydroxychloroquine, which is premature and, according to at least some reports, could be potentially harmful to patients [ , ] . in addition to quantifying the response of the scientific community to these epidemics, we also collected google trends data on the principal search terms representing each of these diseases. google trends data do not represent the number of google searches during a given time period but rather anonymized, aggregated, and normalized information about the relative proportion of searches on google for a given search term, region, and time period [ ] . this normalization protects against places with drug discovery today volume , number july reviews larger populations being weighted more heavily. a value of corresponds to the maximum search interest in a topic for the given time and location, whereas a value of corresponds to the search term being half as popular as it was at its peak. google trends as a whole serves as a fairly representative data source for the public perception of different news topics, indicating how interested people appear to be in a given topic over a specific period of time, with rapid peaks in search interest indicating a sudden increase in interest in a topic. first, we gathered and standardized google trends data for each of the epidemics based on their start dates and relative time periods (no google trends data are available before january ; thus, some timepoints for bird flu and sars were not available). when observing the first months of the outbreaks, the response of the public to the epidemic appeared to peak during the first months (fig. ). for diseases that had been previously studied, such as swine flu and ebola hemorrhagic fever, there was already some search interest before the start date of the epidemic and, thus, the relative search interest in these diseases was higher during the earlier months of the epidemic. these outbreaks were shorter lived than some of the other diseases, such as mers (infection continues today, although cases peaked in early ) [ ] , the relative search interest in which peaked an anomalous months into the outbreak. covid- is interesting in that, although representing a novel virus, the number of relative searches peaked relatively close to those for all the previously studied illnesses (mentioned earlier) and is following an increasing exponential trend (fig. ). the number of cases of covid- worldwide is increasing in a nearly exponential fashion, with more people infected worldwide than in any of these previous epidemics besides h n (swine flu). additionally, for all diseases previously mentioned, their respective search interests peaked during the years when the epidemic was most severe (fig. ) , so we should expect that the search interest for covid- will steadily decrease when the pandemic begins to die down. however, as noted earlier, although we should expect such evolution of the general public interest, research into understanding this disease and developing powerful therapeutics should continue unabated. before the outbreak of sars-cov- , previous studies had forecasted the re-emergence of a sars-like betacoronavirus [ ] . the potential for sustained transmission of sars-cov- or for the emergence of another novel betacoronavirus is alarming but has not been sufficient previously to garner any substantial drug discovery or vaccine efforts. these transmission dynamics and this potential for wide-scale, future pandemics makes covid- distinct from the six previous epidemics examined earlier. thus, the response to the hiv pandemic, which began in , is also a worthwhile comparison to the current covid- response. the change in the number of cases, clinical trials, and publications for hiv is shown in fig. , with marked time points delineating when novel anti-hiv drugs were approved by the fda. it is generally agreed that hiv can now be managed thanks to the powerful pharmacotherapies developed since the pandemic began during the early s. this success is predicated on the significant and constant increase in federal funding for hiv over the course of the epidemic, rising from just a few hundred thousand dollars in financial year (fy) to more than us$ . billion in fy [ ] . this support has allowed the research community to continue to study the disease, accounting for an average of > papers published per year over the past years, as annotated in pubm q ed (box ). azidothymidine, better known as azt, was the first drug approved in the usa to treat hiv and was tested in clinical trials and approved for use in patients in , years after the pandemic began [ ] . azt was originally synthesized for use as an anticancer agent [ ] and was then repurposed against hiv. given the toxicity of azt and the rapid evolution of drug resistance [ ], new compounds targeting various aspects of viral replication were eventually designed, encompassing different classes of antiretroviral drug, such as non-nucleoside/nucleotide reverse-transcriptase inhibitors, protease inhibitors, integrase inhibitors, and fusion inhibitors [ ] . currently, combinations of these different classes of antiretroviral drug are used in what is known as antiretroviral therapy (art) to prevent the development of drug resistance. to date, distinct medications and combination therapies have been approved by the fda, most during the early s [ ]. the first nonrepurposed drugs approved by the fda were lamivudine (a non-nucleoside reverse transcriptase inhibitor) and saquinavir (a protease inhibitor), both approved in , nearly years after the start of the pandemic [ ] . later during the pandemic, new drugs were developed in a streamlined process where lead compounds active against hiv in vitro were structurally modified to improve their efficacy and lower their cytotoxicity [ ] . high-throughput screening (hts) campaigns have also proven useful in identifying existing compounds with promise against hiv [ ] [ ] [ ] . hiv serves as a prime example of how scientists in academia, government, and industry can combine efforts to combat a common threat. even though it took years to name the virus that caused the disease and years to approve the first medication, cumulative scientific efforts ultimately paid off in the form of a diversity of treatment regimens available and a major improvement in life expectancy in most parts of the world [ ] . although sars-cov- might continue to circulate in humans for some time, it differs from hiv in that it has an airborne transmission route and a lower mutation rate because of polymerase proofreading abilities [ , ] . the latter is especially important to consider when developing therapies, vaccines, and antiviral medications. a lower mutation rate means that there is a higher probability of new treatments working for lengthier periods of time. it is also important to consider whether an infection is chronic or if recovered individuals can become sick again, examples being hiv and influenza, respectively. epidemics last longer or can recur if the pathogen at hand meets these criteria, thus allowing more time for drug discovery to have a tangible output, such as art for hiv and annual vaccines for influenza. infection with sars-cov- is not known to be chronic and neither is it known definitively whether secondary infection can occur [ ] , but this or a similar virus is predicted to eventually reemerge, meaning that drugs developed and tested now will be useful for future epidemics [ ] . the sustained transmission of hiv within the human population, coupled with its extraordinary evolutionary rate, has provided a lasting incentive for pharmaceutical companies to continue identifying new antiviral medications to treat hiv [ ] . similarly, the expectation of another covid- pandemic motivates rapid drug discovery efforts now for sars-cov- . there are multiple efforts to identify existing drugs that could be repurposed to combat sars-cov- [ ] . additionally, modern computational techniques, such as quantitative structure-activity relationship (qsar) modeling, molecular docking, and machine-learning approaches are being used now in covid- drug discovery efforts [ , , ] . it should be expected that achieving success in developing covid- therapies and preparing for future epidemics will require a substantial, lasting, and well-funded research effort. the sheer amount of resources at our disposal and the velocity of response of the scientific community so far, compared with the successful response to hiv, suggests that the development of drugs to treat covid- in the coming years is attainable. in recent years, there has been a noticeable uptick in the response of the scientific community to outbreaks. we have observed that the ebola ( - ) and zika ( - ) outbreaks had more publications and clinical trials performed in a shorter time period than for previous outbreaks (figs and ) . however, the response of the scientific community to the covid- outbreak represents the most rapid response yet, with an unprecedented number of clinical trials and publications both in peer-reviewed journals and preprint servers. however, clinical and research efforts in response to past epidemics have failed to yield therapies during or after epidemic period; traditionally, it has taken years for effective drugs and vaccines to be developed and approved, much past the point at which they could be clinically useful and/or could be tested in clinical trials. the speed of the response of the scientific community to major outbreaks has increased over the past two decades matching the increasing accessibility of information via the internet, but the outcome of this response has not changed. despite the presence of various intergovernmental agencies, such as the who and the united nations, individual nations have traditionally responded to large-scale epidemics in a disjointed fashion, and they desperately lack a pre-established, adequate, continuous, and centralized effort even within their own countries, let alone, internationally [ , ] . in the usa in particular, initial measures to predict or surveil disease outbreaks have been alarmingly either defunded or downsized over this same period in which the output of publications and clinical trials has increased [ , ] . as seen in the data presented in this study, the viruses that had already been studied before their respective epidemics garnered more publications and clinical trials in a shorter period of time than those that were understudied previously. zika virus and ebola virus had both been discovered long before they caused significant morbidity and mortality around the world; yet, they were not studied extensively, and thus, the correlated response of the scientific community was inefficient [ ] . had the scientific community had a consistent stream of funding to conduct research on coronaviruses that have hit humankind twice before the current pandemic, there might have been the prospect of pharmacotherapy on the near horizon [ ] . when preparing for possible future epidemics, it is better to be safe than sorry; for example, the us government has a stockpile of tecovirimat, a drug used to treat smallpox, in the event of an act of bioterrorism [ ] . this drug was developed solely to prevent a future outbreak, given that smallpox was eradicated decades ago because of global vaccination efforts. similar prescient efforts are underway to prevent the next pandemic, such as the rapidly emerging antiviral drug discovery initiative (readdi). this initiative is aiming to create and stockpile a novel broad-spectrum antiviral drug in preparation for the next pandemic [ ] . the history of hiv and aids shows clearly that steady interest and robust investment in the study of the disease have yielded the desired fruits. although the speed of response to covid- by both the research community and the public is unprecedented, the world needs to pay more attention to infectious diseases before they have the opportunity to cause lasting economic, social, and physical damage to people around the globe [ ] . we hope that the data and their analyses presented in this article will stimulate both the funding agencies (governmental, private, etc.) and the scientific community to maintain their interest in searching for an efficient treatment for covid- , given the comparison with previous, large-scale epidemics. given the unprecedented increase in clinical trials and publications, the growing public interest in the disease, and the looming threat of a future outbreak, it is unlikely that these trends will die down in the coming months and years. vaccine development for covid- is occurring at an unprecedented pace: on average, vaccine development takes years, but there is a hope to have a vaccine available for emergency use by early [ ] . however, global research efforts need to become more focused if we are to combat this pandemic effectively. we believe that the ongoing research and clinical trials should be a product of international and intergovernmental collaboration driven by knowledge discovery and artificial intelligence approaches as applied to data from both past and present epidemics. finally, substantial funding on the part of federal and state agencies, along with private foundations, is necessary to support massive research efforts to find a cure or a vaccine more quickly than in the past, and to stay prepared for future outbreaks of viral diseases. it is not enough to say that there are more resources available at our disposal now than ever; it is a matter of using these resources effectively. the historical response to hiv sets a precedent for success in the fight against emerging infectious diseases. we shall use this historical precedent and past failings as a guide for the current battle against covid- and all future battles against other, imminent outbreaks. achievements in public health the cost and challenge of vaccine development for emerging and emergent infectious diseases the benefits and costs of using social distanci q ng to flatten the curve for covid- the neglected dimension of global security -a framework for countering infectious-disease crises coronavirus social-distancing forces painful choices on small businesses forecasting covid- impact on hospital bed-days, icu-days, ventilator-days and deaths by us state in the next months what is google trends data-and what does it mean covid- likely to weigh on u.s. election turnout, outcomes coronavirus will change the world permanently. here's how. politico mag jhu ( ) covid- dashboard by the center for systems science and engineering worldometer ( ) coronavirus cases statistics and charts an interactive web-based dashboard to track covid- in real time coronavirus: why you must act now coronavirus: how media coverage of epidemics often stokes fear and panic misinformation making a disease outbreak worse: outcomes compared for influenza, monkeypox, and norovirus in silico screening of chinese herbal medicines with the potential to directly inhibit novel coronavirus rapid identification of potential inhibitors of sars-cov- main protease by deep docking of . billion compounds. mol a data-driven drug repositioning framework discovered a potential therapeutic agent targeting covid- computational models identify several fda approved or experimental drugs as putative agents against sars-cov- multiple reassortment events in the evolutionary history of h n influenza a virus since the domestic and international impacts of the -h n influenza a pandemic: global challenges zika virus: history, emergence, biology, and prospects for control integrating clinical research into epidemic response: the ebola experience chloroquine and hydroxychloroquine in covid- advance of promising targets and agents against covid- in china comparative epidemiology of middle east respiratory syndrome coronavirus (mers-cov) in saudi arabia and south korea a sars-like cluster of circulating bat coronaviruses shows potential for human emergence federal funding for hiv/aids: trends over time a timeline of hiv and aids. hivgov aidsinfo ( ) fda-approved hiv medicines the application of structural optimization strategies in drug design of hiv nnrtis easy-hit: hiv full-replication technology for broad discovery of multiple classes of hiv inhibitors evaluating the efficacy and safety of bromhexine hydrochloride tablets combined with standard treatment/ standard treatment in patients with suspected and mild novel coronavirus pneumonia (covid- ) covid- ring-based prevention trial with lopinavir/ritonavir pre-exposure prophylaxis for sars-coronavirus- hiv epidemiology and the effects of antiviral therapy on longterm consequences viral evolution and the emergence of sars coronavirus immunity passports' in the context of covid- structure of mpro from covid- virus and discovery of its inhibitors artificial intelligence and machine learning to fight covid- mapping the landscape of artificial intelligence applications against improving health and reducing poverty trump disbanded nsc pandemic unit that experts had praised cdc to cut by percent efforts to prevent global disease outbreak tecovirimat for the treatment of smallpox disease antimicrobial division advisory committee meeting. fda readdi: rapidly emerging antiviral drug discovery initiative. readdi responding to covid- -a once-in-a-century pandemic? the covid- vaccine development landscape the authors wish to thank d. adalsteinsson and p. schultz for multiple discussions of the capabilities of datagraph software used to create the figures, and kennie merz for the suggestion to add the analysis of the hiv/aids pandemic. the authors also acknowledge support from the national institute of health (grant u ca ) and biomedical data translator initiative of national center for advancing translational sciences, national institute of health (grants ot tr , ot r ). key: cord- -z pafxok authors: bonasera, aldo; bonasera, giacomo; zhang, suylatu title: chaos, percolation and the coronavirus spread: the italian case date: - - journal: nan doi: . / . . . sha: doc_id: cord_uid: z pafxok a model based on chaotic maps and turbulent flows is applied to the spread of coronavirus for each italian region in order to obtain useful information and help to contrast it. we divide the regions into different risk categories and discuss anomalies. the worst cases are confined between the appenine and the alps mountain ranges but the situation seem to improve closer to the sea. the veneto region gave the most efficient response so far and some of their resources could be diverted to other regions, in particular more tests to the lombardia, liguria, piemonte, marche and v. aosta regions, which seem to be worst affected. we noticed worrying anomalies in the lazio, campania and sicilia regions to be monitored. we stress that the number of fatalities we predicted on march has been confirmed daily by the bulletins. this suggests a change of strategy in order to reduce such number maybe moving the weaker population (and negative to the virus test) to beach resorts, which should be empty presently. the ratio deceased/positives on april , is . % worldwide, . % in italy, . % in germany, . % in the usa, . % in the uk and . % in china. these large fluctuations should be investigated starting from the italian regions, which show similar large fluctuations. the widespread of the coronavirus or covid- virus could be compared to the spread of the red weevil (rhynchophorus ferrugineus) in the mediterranean or fires in california. they start in one or more localized places and quickly spread over larger and larger regions until it becomes difficult to stop them. after that the spread continues to 'affect' more and more regions until there is some 'fuel', i.e. palm trees for the red weevil or woods for the fire. this mechanism is similar to physical systems, for instance turbulent flow or chaotic maps [ ] [ ] [ ] [ ] [ ] [ ] , where a small perturbation grows exponentially and then saturates to a finite value. these at first sight different systems have some common features: a small perturbation, which we will indicate as d , grows exponentially with a coefficient γ, the lyapunov exponent, and finally saturates - to a value ! >>d . the fact that every chaotic system saturates to a finite value, even though might be very large, indicates that the 'phase-space' is however limited and reflects some conservation laws, such as energy conservation for a physical system or the number of palm trees for the red weevil. we can write the number of people for instance positives to the virus (or deceased for the same reason) as: = ( ). in the equation, d gives the time, in days, from the starting of the epidemic, or the time from the beginning of the tests to isolate the virus. at time d= , n( )=d which is the very small value (or group of people) from which the infection started. in the opposite limit, → ∞, ∞ = ! , the final number of affected people by the virus. in a recent paper we have analyzed the sars and the covid- viruses using the equation written above and fitting the three parameters to the data. the model reproduces the data very well on a daily basis starting from march for the italy case . this might be coincidental but it is further supported by the analysis of the virus spread in other countries . the first important result that we pointed out is that to have information on the number of positive to the virus (or fatalities) is not statistically relevant if we do not know the total number of tests performed each day and possibly the method chosen to perform the tests. the method to choose the people to be tested might be biased because of the large number of people affected and the limited amount of tests and facilities. we have been able to obtain quickly the total number of daily tests (and other relevant quantities) from https://github.com/pcm-dpc/covid- . in the figure , we plot the total number of tests performed in all italian regions starting from february , . the different regions are indicated by different colors and/or symbols. the reason for each color will be discussed more in detail later and we have divided them into the dark red, red, blue, cyan and green colors depending on the probability to find a positive to the virus in that region. so dark red gives the highest probability while green is the lowest. notice that the lombardy, veneto and e. romagna regions performed most test. the other important feature to notice is the change of slope after day (starting from february , ), which means that after that day the number of tests performed daily more than doubled. this change on the number of tests and the different number for different countries or regions make it difficult to make predictions on absolute values such as the total number of positives to the virus or any other quantities. thus it is statistically more relevant to define probabilities for instance from the ratio of positives divided by the total number of tests or the number of fatalities divided by the total number of tests etc. the smooth curves are the results of the fits using eq.( ). the fits were performed before march , while the actual data has been update daily to april , . the equation ( ) is well suited to predict these probabilities once a fit has been performed on some preliminary results, we apply it first to italy as a whole. in the figure , we show the results of the fit by fixing the parameters on march , . a previous fit after the start of the epidemic was performed on the positives, it is given by the upper (black) points in the figure. as we saw in figure after about days the number of tests was more than doubled and also quarantine measures were taken by the italian government. this led to the new fit on march given by the cyan points in the figure. since then we have not modified the fit but just added the new daily points which seem to follow the new fit for the positives up to april , . as we can see there seem to be a decrease for the last points. a decrease from the prediction means that the probability to be positive to the virus is decreasing and social distancing plus other measures are giving some results. if we now analyze the fatalities, we first of all notice that the original prediction is followed by the data and we did not need to perform a new fit on march at variance with the number of positives. furthermore, the probability for fatalities seem to follow the prediction and little or no decrease is observed. another important quantity that we plot is the ratio deceased/positives, which should be somewhat independent on the total number of tests if the method to choose the people to be tested does not change. as we can see the ratio keeps increasing and becomes larger than %. this is much higher than germany for instance, which has a large number of positives as well, and can be statistically compared to italy. other countries like spain and uk show similar values for the ratio like italy. these large fluctuations might be attributed to the different health facilities, ventilators, hospital overcrowding etc. it is important to coordinate the action in different countries to try to understand the reason for the discrepancies and save many lives. one possibility for the large number of fatalities in italy respect to other countries could be a time delay between being tested positive and passing away. in fact, in figure we notice that the ratio is less than % before day , similar to other countries. however, china, germany, spain, uk and other countries have been contrasting the virus from more than days already thus any transient effect should be finished. another reason we could explore is hospital overcrowding which leads to a lack of resources to deal with the emergency. in figure we plot the ratio=fatalities/positives as function of the number of people hospitalized each day. a clear correlation between the two variables is visible and we have parameterized it as: ( ). m ,x cr and m are fitting parameters and are displayed in the figure. in particular x cr gives a 'critical' value above which the ratio grows quickly. this parameterization is inspired by critical phenomena such as the liquid-gas (second order) phase transition in normal fluids and also in chaotic maps . if taken literally this result would imply that hospitals should not admit more than ± people daily, however as discussed above for the total number of tests, this plot would be more meaningful if we would know how many patients can the hospital accommodate in normal and safe conditions. in other words the number of hospitalized people might increase together with the number of hospitals involved thus to have an unequivocal correlation the total hospital capability should be known. the latter is not given in https://github.com/pcm-dpc/covid- and, we hope, this will be addressed soon, together with the total number of ventilators available. in any case we expect that some kind of correlation like in figure will remain. other delocalized facilities should be organized to do a preliminary screening and admit to the hospital only the most severe cases, below the capability limits for each facility. figure shows that at most person out of tests positive to the virus but being admitted in an overcrowded hospital increases the probability of serious complications, i.e. more than person out of might die (in germany it is roughly out of presently). an overcrowded hospital implies that the number of ventilators (which seem to be the last resource to fight the virus) is not sufficient. equation ( ) gives a very good description of the probability of being infected by the virus but does not give us any hint on when the virus will stop its deadly action. we could in principle apply the same equation to the number of positives for instance without normalizing by the total number of tests. we have shown that this procedure might be meaningless if the number of tests performed daily changes or varies from region to region. however, from figure we notice that after day the number of test is a straight line, which implies that the number of tests per day is constant. thus we may hope that equation ( ) includes the trivial increase of daily tests in the parameters and try to make predictions. this rather empirical method to predict the evolution of the spread might be justified only by the results. once we get some confidence for some cases we can apply it to other cases keeping in mind that the total number of tests daily must be constant. in the figure , we plot the total number of positives, dismissed healthy and fatalities as function of the day for the italian case. as we can see the fit reproduces rather well the behavior at longer times and the data seem to saturate. for shorter times (below day ) the model disagree with the data due to the fact that the number of tests performed daily increased. thus the prediction of figure should be taken with caution and we hope that at least the order of magnitude is correct. refitting on april , increased the predictions to . e ± ( . e ), ± ( ) and ± ( ) for positives, fatalities and dismissed respectively (in parentheses the values of the fit on march , ). the values of the fits performed at different times are slightly different which suggests that the fit is not convergent yet. however, the results are not much different which is a good sign together with the decrease observed in figure . we can specialize the previous results to each italian region to get important information on the spreading and also to unveil anomalies, which could indicate new . cc-by-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint . https://doi.org/ . / . . . doi: medrxiv preprint centers for the epidemic. we start with the lombardy region, the most affected by the pandemic. figure . same as figure , for the lombardy region. the fits were performed on march but the data has been update to april , . in figure we plot the probabilities as function of the day similar to the italy case reported in figure . the probability to be infected reaches almost % while the fatalities are up to about % and the ratio fatalities/positives is almost %, confirming, even if not needed, that lombardy is the most affected region. differently from figure where a small decrease is observed at later days, lombardy does not show any decrease but it seems very close to saturation. these results must be clarified since if this probability would reflect the actual population then person out of carries the virus. this is not the case since the method to choose the people to be tested is biased, i.e. those who show strong signs of infections are tested until there are no more tests available. this is one reason why some resources should be diverted temporarily to the lombardy region starting for instance from the veneto region, see figure . the large number of fatalities may be due to hospital overcrowding and lack of equipment notably ventilators which seem to be the best tool to fight the infection or at least give more time to the organism to produce antibodies. following the previous result for the italian case, we plot in figure the ratio=fatalities/positives as function of the number of people hospitalized each day for the lombardy and the veneto regions. the behavior and the fit indicated in the figure confirms that lombardy is the region contributing mostly to the epidemics. the smaller ratio for the veneto region is consistent with the lower number of hospitalized persons but higher than other nations like germany. in figure , we use equation ( ) to predict the total number of people affected by the virus. the fits on april , gave ± , ± and ± for positives, fatalities and dismissed cases respectively. these numbers can be compared to the national case given in the previous section: they contribute more than %. again the fits are not so good at shorter times due to the changing number of tests performed daily. following the methods outlined in the previous section, we can summarize the results for each region by looking in detail to all the fits. in figure we plot the probability of being tested positive as function of time for all the italian regions. the fits were performed using equation and following the method explained in the previous sections. all the fits were performed on march , . different regions were grouped with different colors and symbols in the figure in order to distinguish them according to the probability. the regions with the highest probability are lombardy, liguria, piemonte, marche and v.aosta. some might come as a surprise but recall that we are plotting the positives divided by the total number of tests. the veneto region, which is one of the most affected, is represented by the green color, one of the lowest probabilities. this is due to the high number of tests performed in that region as can be seen in figure . comparing figures and one could find reasons to shift resources as needed. as a preliminary method, we should shift resources from regions, which have less than % probability (below the cyan color in the figure) to be infected to higher probability regions. this could be done on a temporary basis, say for a week to see if the probabilities, for the most affected regions, decrease. of course, ideally to increase the total number of tests everywhere would be the best solution. a probability say of % means that almost person out of is affected, thus even in apartments with more than persons living in it, social distancing and other precautions should be enforced. this virus might be asymptomatic, i.e. we might carry it and show no signs, thus the importance to obtain better estimates of the probabilities. it is also important to notice that some regions reach the saturation value earlier than others. for instance the marche (dark red symbols) and e. romagna (red symbols) regions saturate earlier than their respective color groups. a faster saturation might not be good because it might give not enough time to the hospitals to deal with large number of patients arriving at the same time. the umbria region (cyan symbols) is the one that saturates first fortunately with a small number of positives and a large number of tests. for reference the same probability as in figure for germany is about %, https://www.worldometers.info/coronavirus/#countries , which would be the first goal: perform enough tests to be sure that all regions are effectively below % probability. . cc-by-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint . in figure we plot the fatality probability as function of time. notice that the color grouping is not maintained and in particular we notice the large 'jump' of the lazio region (cyan symbols) which goes 'three colors up' while there is some 'improvement' for piedmont (dark red symbols). the veneto region, which is very close to the pandemic center (lombardy), has a fatality rate less than . %. this is the minimum goal, which can be reached by many other regions improving on the support system and maybe moving resources around. the reason(s) why the lazio region is performing so poorly is not clear. the 'best performing' regions like calabria, basilicata and umbria are located in the center, southern part of italy, far away from the center of the infection . sicily is also a surprise being the most southern region and having such a large fatality rate, . %. for comparison the probability for germany is less than . %. in figure we plot the ratio=fatalities/positives, a quantity we have discussed the in previous sections. we mentioned the fact that such a ratio is less than % for . cc-by-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint . https://doi.org/ . / . . . doi: medrxiv preprint germany on april , a value similar to the basilicata region only, but while germany had positives and deaths, basilicata had and cases respectively. figure . fatalities/positives versus time, see the previous figures for the color codes. for comparison, the same ratio for germany was . % on april , . we investigated in detail the spread of covid- in each italian region. the overall statistics shows that the spread is slowing down but not in some regions like lombardy. the most negative feature is the statistical large number of fatalities as compared to the number of people tested positive to the virus. this is most probably due to hospital overcrowding and lack of enough tools like ventilators and sufficient personal protection equipment especially for the medical staff, which is in the front line. we cannot exclude that hospitals are a possible source of the infection, maybe the use of other public buildings transformed into temporary hospitals might help. moving higher risk people, still negative to the virus, to lower density places like . cc-by-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not peer-reviewed) the copyright holder for this preprint . https://doi.org/ . / . . . doi: medrxiv preprint beach resorts might help. it is especially important in our opinion to put the different resources of the regions together to understand the spread more effectively. for instance regions with a lower probability of infection (less than %) might send some of their testing equipment, unused ventilators and maybe some medical personnel to higher risk regions on a temporary basis. ventilators and personal protection equipment are the most needed tools. the country of ferrari, lamborghini and ducati among others, as well as electronics and fashion might divert some industrial capabilities to fulfill the emergency in short times. the more we wait the more people dies. deterministic chaos fluid mechanics mediterr. conf. control autom. -conf. proceedings, med' key: cord- -yx oyv authors: amar, patrick title: pandæsim: an epidemic spreading stochastic simulator date: - - journal: biology (basel) doi: . /biology sha: doc_id: cord_uid: yx oyv simple summary: in order to study the efficiency of countermeasures used against the covid- pandemic at the scale of a country, we designed a model and developed an efficient simulation program based on a well known discrete stochastic simulation framework along with a standard, coarse grain, spatial localisation extension. our particular approach allows us also to implement deterministic continuous resolutions of the same model. we applied it to the covid- epidemic in france where lockdown countermeasures were used. with the stochastic discrete method, we found good correlations between the simulation results and the statistics gathered from hospitals. in contrast, the deterministic continuous approach lead to very different results. we proposed an explanation based on the fact that the effects of discretisation are high for small values, but low for large values. when we add stochasticity, it can explain the differences in behaviour of those two approaches. this system is one more tool to study different countermeasures to epidemics, from lockdowns to social distancing, and also the effects of mass vaccination. it could be improved by including the possibility of individual reinfection. abstract: many methods have been used to model epidemic spreading. they include ordinary differential equation systems for globally homogeneous environments and partial differential equation systems to take into account spatial localisation and inhomogeneity. stochastic differential equations systems have been used to model the inherent stochasticity of epidemic spreading processes. in our case study, we wanted to model the numbers of individuals in different states of the disease, and their locations in the country. among the many existing methods we used our own variant of the well known gillespie stochastic algorithm, along with the sub-volumes method to take into account the spatial localisation. our algorithm allows us to easily switch from stochastic discrete simulation to continuous deterministic resolution using mean values. we applied our approaches on the study of the covid- epidemic in france. the stochastic discrete version of pandæsim showed very good correlations between the simulation results and the statistics gathered from hospitals, both on day by day and on global numbers, including the effects of the lockdown. moreover, we have highlighted interesting differences in behaviour between the continuous and discrete methods that may arise in some particular conditions. france was hit by the sars-cov- epidemic probably at the beginning of january , the first case being reported on january [ ], and went into lockdown on march [ ] . in response to the expected reduction of the number of cases, the french government eased the lockdown restrictions on may and eased them again on may (except in the ile-de-france region, where the density of population is very high). these measures have been taken to stop the exponential growth of the number of cases, as observed earlier in china [ , ] . the basic reproduction number r tells us the average number of new infections caused by an infective individual and it describes the exponential growth of the epidemic [ ] . if r is greater than the epidemic will spread; otherwise, when r is less than , the disease will gradually fade out [ ] . compared to the r of h n ( . ) [ ] the reproduction number of covid- indicates awful potential transmission. the r was estimated as . [ ] , . [ ] and . [ , ] by many different research sources around the world. the world health organization (who) published an estimated r of . to . [ ] . many approaches have already been used to model the covid- epidemic using compartment models and deterministic ordinary differential equations (ode) [ , ] and also to estimate the effects of control measures on the dynamics of the epidemic [ ] . these particular approaches give good results, but they do not take into account the stochastic nature or the spatial aspects of the propagation mechanism. however, stochastic differential equations (sde) have been successfully used to tackle the stochastic aspects of epidemic propagation [ ] [ ] [ ] [ ] . more recently, multi-region epidemic models using discrete and continuous models, taking into account the effectiveness of movement control have been published [ , ] , as well as sde multi-region models [ ] . stochastic models based on economic epidemiology have been applied to the covid- epidemic, for example, in south korea, to determine the optimal vaccine stockpile and the effectiveness of social distancing [ ] . approaches using agent-based systems have also been used to model both the stochastic and spatial characteristics of epidemic propagation [ , ] . in agent-based methods the number of machine instructions needed for each timestep, relative to the size of the data (algorithmic complexity), is at best proportional to the number of agents. those using one agent per individual may need a high computing power when used on large populations. these approaches are often applied to smaller areas (towns mainly) than the entire country, and/or use one agent to model a set of individuals ( in [ ] ). population-centred methods have an algorithmic complexity that does not depend on the size of the population, but on the number of rules considered at each iteration (for example, the number of reactions for biochemistry systems). when used on large populations these methods are much more efficient than entity-centred methods, but they do not take into account the spatial localisation. we adopted here a hybrid model derived from the sub-volumes method that adds coarse-grained spatial localisation capabilities to the standard stochastic simulation algorithm (ssa) used, for example, in the domain of biochemistry. to increase the computing efficiency we also used an original variant [ ] of the gillespie algorithm with tau-leaping [ , ] that automatically adapts the proportion of randomness vs. average-calculation, at each timestep. our implementation allows us to easily switch from this stochastic variant of ssa to a deterministic continuous solver (dcs), and therefore compare the two methods. to test our approach we applied it to the sars-cov- epidemic in france where relevant data [ , ] have been made available throughout the duration of the epidemic. most of the simulation parameters we used have been obtained from statistics gathered in the literature, such as the proportion of cases that needed hospitalisation and the proportion of severe forms among them [ , ] that needed beds in icu (intensive care unit). the number of infectious individuals and their localisations at the beginning of the epidemic have been inferred from statistical data made available by the french government and from the literature [ ] [ ] [ ] . we used our simulation tool to ascertain the effects of control measures on the dynamics of the epidemic and compared the results to the real statistical data. we focused our study of the impacts of the epidemic only on the part of the population that moves on a daily basis: workers, pupils, students, retired people, etc. people in nursing homes were not taken into account since their environment and way of life are very different. starting from a known initial state, we wanted to compute a stochastic sample of the evolution in time of the number of people at each state of the disease. a transition between such states is often described by a set of probabilistic rules, or by a stochastic automaton. the epidemic spreading can be modeled as a markovian process in the sense that the number of people in each state at time t + ∆t depends only on the numbers at time t (and on other variables that do not depend on t). in most of the cases, it is not possible to find an analytic solution that gives those numbers as a function of time. hopefully, iterative numerical methods exist. one of them is the gillespie algorithm, frequently used to find the evolutions of the quantities of chemical species s(t) = {s (t), ..., s n (t)} that can react according to chemical rules r = {r , ..., r m } and their kinetics k = {k , ..., k m }. starting from the initial value s( ) of the n species, the algorithm computes the values at time t > by iterating the following process: . based on the quantities s(t), the rules and their kinetics, compute stochastically at what time each reaction is triggered {t , ..., t m }. . let r i being the next reaction: t i = in f {t , ..., t m }. . apply r i ; i.e., update the vector s(t i ) by decreasing the quantities of the substrates of r i and increasing the quantities of its products. . update the time: t ← t i . this algorithm gives an exact stochastic trajectory of the system, but can be slow when some reactions are quick. these quick reactions will often be triggered, so the time increment at each iteration will be small and the number of iterations per second high. to decrease the computing time, the tau-leaping method uses a fixed timestep, τ. at each iteration, the number of times each reaction is triggered during the time interval τ is stochastically estimated based on the quantities at time t. this method gives an approximation of the stochastic trajectory of the system, which is accurate as τ is small. the value of τ must be chosen to be large enough to minimise the number of iterations per second, but not too large to get good precision. the algorithm used in pandaesim, a variant of the tau-leaping gillespie method, is detailed at the end of this section. the population-centred methods such as those presented here share the same constraint: the entities evolving in the environment are considered homogeneously distributed in the environment. in other words, the spatial localisation is not taken into account. the entity-centred approaches, which compute the behaviour of each individual at each timestep, take into account the spatial localisation of each individual, but need much more computing power. to add coarse grained spatial localisation to our model, we partitioned the territory in sub-regions where one instance of a population-centred ssa is run. these instances use the same timestep and are synchronised. the interactions between sub-regions are modelled by taking stochastic samples of individuals that travel between sub-regions. this is done at a higher time scale since such travelling is less frequent than the travelling inside the original sub-region. most of the individuals that travel go back in their home sub-regions after a variable period of time. thus, the population of each sub-region remains approximately the same, although people enter and leave the sub-region. if this is not taken into account in the model, the population of each sub-region may tend to become the same as time goes on. we describe in the next section how this constraint is implemented in our model. the territory studied is partitioned in two levels of geographical organisation: region and sub-region. a region contains at least two sub-regions, a sub-region belongs to only one region and all the territory is covered (partition). in our case study, france, the first level is the administrative région, each one containing from two to a dozen départements. there are régions and départements in france. of course this can be applied to any partition of a territory. for example in england we could use the nine regions for the first level, and the ceremonial counties and greater london for the second level. the population is divided into four age slices: to years old, to years old, to years old and over years old [ ] [ ] [ ] . each of these four sub-populations has its own values for the population parameters (infection immunity, travelling rate, etc.). we used one instance of a population-centred simulation process for each sub-region, with a one hour timestep. the simulation of the upper level (region) uses a bigger timestep, one day, and mainly processes the people which are travelling to another sub-region. thus, the population distribution is supposed homogeneous inside each sub-region, but can be heterogeneous at the region level and therefore at the level of the entire territory. depending on the age, and except for ill or hospitalised people, each day, people have a probability to travel from their homeplace to some place else either belonging to the same sub-region (local travel) or to another region (remote travel). these probabilities are part of the population parameters mentioned earlier. of course, quarantine type control measures forbid any kind of local or remote travel; people must stay in their respective homes sub-regions. the number of people of each age slice leaving their home sub-regions is a stochastic sample (or averaged value for the deterministic continuous solver) of a percentage of the population of this sub-region. for local travel, they are scattered according to the relative population of each sub-region belonging to their region. the more populated sub-regions attract more of the travellers. for remote travel, people go from their home-regions to the most populated sub-regions of the other regions, where airports and train stations are. the same method is used to dispatch the travellers according to the relative populations of their destination sub-regions. this way of computing how many individuals travel and where they go is a simple way to maintain constant the density of population of each sub-region. the sub-region population-centred model is a variant of the widely used susceptible, exposed, infectious and removed model. we added two states: hospitalised and deceased. the exposed and infectious states have slightly different meanings in our model; they have been renamed to asymptomatic and ill ( figure ). unlike ill people, who show symptoms of the disease, recently infected people are asymptomatic hosts, but both of them are infective. hospitalised patients are also contagious, but to a lesser extent because they are confined inside the hospital. the three red dotted arrows in the figure indicate the potential sources and targets of the infection. we have assumed that people in recovered state are immune to the virus and therefore cannot be reinfected [ ] . an incubation period of approximately five to six days before the apparition of the first symptoms has been observed [ , ] . in consequence, in our model, asymptomatic people are subdivided into six subcategories according to the number of days since contamination. a large majority of cases, around %, present a mild form of the disease which is probably even not reported. the other cases need hospitalisation, and among them, from % [ ] to more than % [ ] present severe forms wherein patients need to be admitted in icu. the duration of the disease, after the incubation period, depends on the age of the patient an on the severity of the form of the disease. in our model it has been set to a maximum of days, and therefore we have subdivided the ill (resp. hospitalised) people into at most subcategories according to the number of days since the apparition of the first symptoms (resp. the date of the hospitalisation). people with mild infections will recover after a stochastically variable period of time ( to days) that depends on their age. the severe form of the disease is (stochastically) lethal according to a rate also varying with the age of the patient. the deterministic solver uses fixed average values. all these rates, probabilities and average durations are parameters of the model. their values came or were inferred from observed statistics of real cases. as mentioned before, the simulation algorithm uses a one hour timestep. it mainly computes in a stochastic way the state vector: i.e., the number of people that is in each state and subcategory, at each timestep. there are four state vectors, one for each age slice. of course these four vectors are not independent since whatever their age is, contagious people can infect susceptible people regardless of their own age. basically, from the value of the state vector at time t, the process computes the new value of the state vector at time t + τ (here τ = h). thus, starting from a known initial value of the state vector at time t = , we can obtain its value at any time (t = t end ) > by iterating this process until t end is reached, or until a specific value of the state vector is reached. pandaesim automatically stops the simulation when there are no more infective people. our model assumes that people have uniform daily routines. without specific measures, the daily schedule begins at o'clock in the morning for work (or school, university, etc.) with the use of public transportation for one hour. next comes staying at work three hours, followed by a two-hour midday break, four hours in the afternoon at work, another hour in public transportation to go back home and the remaining hours at home. we defined four possible environments, each one having its probability of contagion: home, public transportation, workplace and restaurant. these parameters have default values that reflect the local concentrations of people: very low at home, higher at work and restaurant and much higher in public transportation. to reduce the number of parameters we used the same value for the workplace and the restaurant. many kinds of measures can be used to slow down the propagation of the epidemic; we implemented two examples of such measures: . soft quarantine: people do not use public transportation at all and do not go to restaurants during the midday break. . full quarantine: this corresponds to what actually happened in france; people were confined at home except for a one hour stroll per day in low populated areas (public parks, forests, etc., were forbidden). again, to reduce the number of parameters, we assumed that the probability of contagion during the stroll was the same as at work. this also allowed us to take into account errands made to get food in more populated places such as groceries or supermarkets. starting from an initial state (number of contagious people in each sub-region), the simulation algorithm iterates the following process at each timestep until either the epidemic ends or the maximum duration of the simulation is reached (defaults to days). . first, the infection rate at time t, i rt (t), is computed as the product of the global daily rate of infection, g dri (t), by the infection factor of the current location (home, workplace, public transportation) l in f (t). this infection rate i rt (t) is used the same way the propensity is in the standard ssa. then, for each of the four age slices the deterministic continuous solver computes the average number of individuals of that age that will go from susceptible to asymptomatic state, avnew asympt , as the product of the population in that state and the infection rate at time t: the stochastic discrete solver (sds) computes stochastic integer numbers such that, on the long run, they will average to the same values as the continuous solver. even when the population is an integer number of individuals, this product, avnew asympt , is generally a floating point number because the infection rate is itself a floating point number. this number has an integral part (≥ ) and a fractional part (between and ). the (discrete) number of new asymptomatic hosts is then computed as the integer part of the average number, plus if a uniform random number taken into the interval [ . . . ] is below the fractional part: as the difference is . on the average, the higher the value is, the lower the relative impact of this stochastic discretisation becomes and the result is equivalent to a discrete averaged approach. conversely, the lower the value is, the more important the stochastic discretisation becomes. this mechanism allows the simulator to automatically choose the best strategy to adapt to the value range of the population [ ] . . finally, when the current time indicates the beginning of a new day, t ≡ (mod ), individuals in each state either remain in the same state but shifted by one day, or change to another state. all the states transitions are computed stochastically by the sds (or deterministically by the dcs) using the method described earlier. • the population in the asymptomatic state that has on average reached the / day limit is moved to the first day of the ill state. • according to the illness duration by age slice parameter, a proportion of the population in the ill state is moved to the hospitalised or to the recovered state. the others remaining in the ill state one more day. • according to the disease severity by age slice parameter, a proportion of the population in the hospitalised state is moved to the deceased or recovered state. the others remain in the hospitalised state one more day. the global daily rate of infection is then simply computed by multiplying the constant of propagation of the virus, k prop , by the proportion of the total contagious population: by fitting the simulation results after the beginning of the lockdown to the data gathered from hospital statistics, we empirically found a good estimation of k prop for the sars-cov- to . . we think that using pandaesim to model another type of epidemic, only this constant, along with the severity parameters, needs to be changed. we applied our simulation tool to the sars-cov- epidemic in france. we used the partitions of région and département in the country for the regions and sub-regions of our model. most of the parameters we used were gathered from the literature and statistical data made available by the french government. a few others were obtained empirically, mainly the number of contagious people in each région at the beginning of the simulations, and the constant of propagation of the sars-cov- . the per-age values of the percentage of lethality [ ] , illness duration and percentage of local and remote travellers are shown on table a , the various rates of contamination on table a , and the initial number of contagious people in each département on table a in appendix a. in order to test our population-centred algorithm, we first ran simulations without countermeasures and without any travel possibility, either local or remote. these simulations were run using successively the stochastic discrete solver and the deterministic continuous solver. when the initial number of contagious people was relatively high, for example, in the val-de-marne sub-region ( ), the results for both solvers were nearly identical: deaths for the average of stochastic runs and deaths for a deterministic run (figures and ) . the standard deviation for these runs went from ≈ at the beginning of the simulations (with a few tens of deaths) to ≈ at the peak of the infection (a few thousands of deaths), and then ≈ at the end. the same kinds of results appeared for the ill people with the maximum value of the standard deviation of ≈ reached on the th day, with , ill people. on the other hand, when the initial number of contagious people was low, as in loiret ( ), the dcs did not find any deaths, whereas runs of the sds showed two distinct behaviours; of these runs showed the same results as the dcs, no deaths at the end of the epidemic. the other runs took another direction leading to deaths on average with a standard deviation of ≈ ( figure ). the reasons for this apparent inconsistency will be explained in the discussion section. using the countermeasure applied in france (lockdown) the simulations showed us retrospectively that the probable date whereat there was a total of contagious people in france (beginning of the simulations) was approximately the end of january . this correlates with the period of time when the first deceased person was reported ( january). the view of the main window of pandaaesim shown on figure displays the real numbers of deceased people in each département. the map shown on figure displays the mean values of runs of a stochastic simulation. the overall results are very close, , for the real statistics and , for the mean value of the simulations. the département by département results are also fairly close, except for a few départements, but the orders of magnitude are more or less identical. to determine whether there is a form of convergence of stochastic trajectories to average values, we ran hundreds simulations and computed the mean value of the number of deaths (and of the other states) at each time step, in each département. the results showed no unique limit values, but the averages obtained with many runs stayed inside a range of values near the real statistics. we also ran pandaaesim using the deterministic continuous solver with the same parameters. the results were completely different: the epidemic ran only for days ( to weeks less) and reported deaths (figure ) , far from the , obtained with the stochastic simulations. the results département by département are also very different, with more than half the départements showing no deaths at all. again, probable reasons for this inconsistent behaviour are proposed in the next section. we developed a hybrid model and simulation programme derived from standard models and simulation techniques widely used in the fields of epidemic propagation and biochemistry. our approach used an original variant of the gillespie ssa with tau-leaping, where the inner algorithm can be easily switched from stochastic discrete to deterministic continuous. this allowed us to compare these two methods of simulation. to test our approach we applied it to the sars-cov- epidemic in france, for which relevant data were available. we also tested the consequences and the efficiency of the lockdown countermeasure applied in france for days. in order to gain spatial localisation but with an efficient population-centred algorithm where the population was supposedly being homogeneous, we partitioned the territory into relatively small units for which an instance of the population-centred simulation was run. the movements of populations between these units were taken into account at a higher scale, with a larger timestep. we first tested one instance of our population-centred algorithm, where no countermeasure was used. using each method (sds and dcs) with the same parameters values, we compared the results in two different situations: (i) with a moderately high number, and (ii) with a very low number of initially contagious people. when the numbers were relatively high, the results of both methods were very similar. this was not surprising because at each timestep the absolute value of the increment computed by each method must be significantly higher than , and the stochastic rounding to the inferior or superior integer cannot be relatively very far from the floating point value computed by the continuous method. however, when the numbers are low, the absolute value added at the next timestep is only a bit higher than , and therefore the stochastic rounding to or to drastically changes the future trajectory. this is particularly important in this very case where the populations experience an exponential growth. this may look like chaotic behaviour since a small difference in initial conditions can lead to very different futures, but when the numbers grow, the importance of this switch effect is dampened. we used many simulations batches with initially only two contagious individuals in the sub-region. the results of , , and simulations showed approximately the same proportions of cases, ≈ %, ending with no death at all, while the rest of the batch converged to approximately deaths. the same model using the dcs show no death at all. we think this behaviour is a consequence of a bifurcation due to the high non-linearity of the system. when the number of contagious individuals is below a certain threshold, the contagion tends to fade, but if this number goes over the threshold, there is a kind of positive feedback that increases it until a large enough part of the total population is removed. if we assume that the initial number of contagious individuals in our example ( ) is below the threshold, the result shown by the dcs is therefore correct. due to both its discrete increments and its stochastic behaviour, the sds can sometimes compute a trajectory that goes above the threshold and switches the other way. in order to deepen the study of this bifurcation phenomenon, we have tried to find the approximate value of the threshold. first we used the dcs with the initial number of contagious individuals varying from to . no deaths were found up to ; then deaths from to ; and deaths for and above. then we did the same tests with sds runs, counting the number of runs leading to zero deaths, and in the other case, the average number of deaths. with initially to contagious individuals, the number of runs leading to no deaths decreased from to ; with six and above initially contagious individuals no more simulations lead to zero deaths. for all the runs not leading to zero deaths, the average number of deaths was ≈ . the threshold for the sds is somewhere below . as expected, this value is very low. then we tested the whole simulator with all the population-centred processes, running independently for timesteps in each sub-region and then synchronised by exchanging a portion of each population either stochastically or deterministically. again, depending on the type of solver chosen and for the reasons mentioned earlier, the results were different but not by too much. with the number of people travelling from a given sub-region being a (small) fraction of the total population of this sub-region, the consequences in terms of infection spreading are very dependent on the value itself: less than , it is amplified by the stochastic processing, or else smoothed with the continuous calculation. both global results and sub-regions' local results were found to be very similar using the two methods. this can be explained by noticing that sub-regions with low initial contagious populations "benefit" from the migration of contagious people from more populated sub-regions, and as no countermeasure is applied, the number of contagious people grows rapidly over the threshold. the main difference appears in the shape of the nglobal curves: the deterministic solver showed a bigger dependency on the propagation effect ( figure ). since the dates sub-regions had their peaks of contamination were very different, the propagation effect was slower. although the global number of deaths is approximately the same ( , for the dcs, , for the sds) the slope of the curve obtained with the sds is steeper than the one obtained with the dcs (figure ). this can be explained by the relative sequentiality of the infection peaks showed by the continuous solver, whereas with the stochastic solver all the peaks are almost simultaneous and therefore the resultant is higher. for our last test, we set the simulator with the equivalent of the lockdown countermeasure used in france. the effect of this countermeasure was to decrease the number of contagious people, and while the sds gave results that correlate with the real statistics ( figure ), the dcs did not work well mainly because the initial number of contagious people was too low to be taken into account (figure ). more than half the départements did not show any death and therefore the total number of deaths was largely underestimated. we speculate that if we start from an initial state where there are enough contagious people in most sub-regions, it is very likely that the dcs will yield reliable results. this study gave us the opportunity to compare two different methods to get the trajectory of a complex system. at the beginning we were confident that they would yield very similar results, but facts proved us wrong. the reasons that caused the inconsistency of the behaviour of the stochastic discrete algorithm on the one hand and of the deterministic continuous algorithm on the other hand, lead us to be more confident in the stochastic approach for the simulation of this particular epidemic spreading model. more generally, with this type of model, an exponential growth phase is very sensitive to any variation, even small, in the initial values, and to artefacts, or calculation errors, and can therefore sometimes exhibit chaotic behaviours. nevertheless, this hybrid approach, a mix of an efficient population-centred process that plays the role of an agent in a multi-agent system, seems very promising. the stochastic simulations' results were very similar to the real statistics gathered from hospital data. future works could include improvements to the simulator such as the implementation of other types of countermeasures, the use more accurate methods to model the behaviour of individuals and the use different types of sub-regions to reflect their diversity. in this study we supposed no possible reinfection, so the epidemic effectively stopped after certain amount of time. although simplifying the model, this assumption forbids the possibility of modelling other waves of infection. recent publications discussed the consequences of different transmission scenarios, with and without permanent immunity, that can lead to multiple waves of infection [ ] . an interesting perspective would be to include in our model a probability of reinfection in order to test the effectiveness of countermeasures. funding: this research received no external funding. acknowledgments: many thanks to martin davy at sys diag, for the early version of the parameter dialog box, and the gathering of information about the sars-cov- . the authors declare no conflict of interest. the following abbreviations are used in this manuscript: in order to fit the simulation results to the real statistics, we estimated the number of asymptomatic hosts in each sub-region (départements) at the beginning of the simulations (table a ) . per-age values of the percentage of lethality (extrapolated from [ ] ), illness duration, and percentage of local and remote travellers (table a ). rates of contamination according to the location, percentage of hospitalised patients who can infect healing people, and proportion of severe form of the illness (table a ) . first cases of coronavirus disease (covid- ) in france: surveillance, investigations and control measures portant réglementation des déplacements dans le cadre de la lutte contre la propagation du virus covid- . legifrance the effect of human mobility and control measures on the covid- epidemic in china an investigation of transmission control measures during the first days of the covid- epidemic in china on the definition and the computation of the basic reproduction ratio r in models for infectious diseases in heterogeneous populations preliminary estimation of the basic reproduction number of novel coronavirus ( -ncov) in china, from to : a data-driven analysis in the early phase of the outbreak early estimation of the reproduction number in the presence of imported cases: pandemic influenza h n - in new zealand early transmission dynamics in wuhan, china, of novel coronavirus-infected pneumonia novel coronavirus -ncov: early estimation of epidemiological parameters and epidemic predictions nowcasting and forecasting the potential domestic and international spread of the -ncov outbreak originating in wuhan, china: a modelling study transmission interval estimates suggest pre-symptomatic spread of covid- coronavirus latest: scientists scramble to study virus samples transmission dynamics of the covid- outbreak and effectiveness of government interventions: a data-driven analysis the effectiveness of quarantine and isolation determine the trend of the covid- epidemics in the final phase of the current outbreak in china centre for the mathematical modelling of infectious diseases covid- working group the effect of control strategies to reduce social mixing on outcomes of the covid- epidemic in wuhan, china: a modelling study the behavior of an sir epidemic model with stochastic perturbation the long time behavior of di sir epidemic model with stochastic perturbation a stochastic sirs epidemic model with infectious force under intervention strategies a stochastic differential equation sis epidemic model a multi-regional epidemic model for controlling the spread of ebola: awareness, treatment, and travel-blocking optimal control approaches a multi-regions sirs discrete epidemic model with a travel-blocking vicinity optimal control approach on cells role of media and effects of infodemics and escapes in the spatial spread of epidemics: a stochastic multi-region model with a study on herd immunity of covid- in south korea: using a stochastic economic-epidemiological model epidemic spreading in urban areas using agent-based transportation models an open-data-driven agent-based model to simulate infectious disease outbreaks hsim: an hybrid stochastic simulation system for systems biology a general method for numerically simulating the stochastic time evolution of coupled chemical reactions stiffness in stochastic chemically reacting systems: the implicit tau-leaping method données en santé publiques info coronavirus covid clinical characteristics of coronavirus disease in china critical care utilization for the covid- outbreak in lombardy, italy: early experience and forecast during an emergency response cluster of covid- in northern france: a retrospective closed cohort study the french connection: the first large population-based contact survey in france relevant for the spread of infectious diseases cmmid covid-working group, estimating the infection and case fatality ratio for coronavirus disease (covid- ) using age-adjusted data from the outbreak on the diamond princess cruise ship estimating the asymptomatic proportion of coronavirus disease (covid- ) cases on board the diamond princess cruise ship reinfection could not occur in sars-cov- infected rhesus macaques the incubation period of coronavirus disease (covid- ) from publicly reported confirmed cases: estimation and application serial interval of covid- among publicly reported confirmed cases projecting the transmission dynamics of sars-cov- through the postpandemic period this article is an open access article distributed under the terms and conditions of the creative commons attribution key: cord- -bkkzg h authors: blanco, natalia; stafford, kristen; lavoie, marie-claude; brandenburg, axel; gorna, maria w.; health, matthew merski center for international; education,; biosecurity,; medicine, institute of human virology -university of maryland school of; baltimore,; usa, maryland; epidemiology, department of; health, public; medicine, university of maryland school of; nordita,; technology, kth royal institute of; university, stockholm; stockholm,; sweden,; biological,; centre, chemical research; chemistry, department of; warsaw, university of; warsaw,; poland, title: prospective prediction of future sars-cov- infections using empirical data on a national level to gauge response effectiveness date: - - journal: nan doi: nan sha: doc_id: cord_uid: bkkzg h predicting an accurate expected number of future covid- cases is essential to properly evaluate the effectiveness of any treatment or preventive measure. this study aimed to identify the most appropriate mathematical model to prospectively predict the expected number of cases without any intervention. the total number of cases for the covid- epidemic in countries was analyzed and fitted to several simple rate models including the logistic, gompertz, quadratic, simple square, and simple exponential growth models. the resulting model parameters were used to extrapolate predictions for more recent data. while the gompertz growth models (mean r = . ) best fitted the current data, uncertainties in the eventual case limit made future predictions with logistic models prone to errors. of the other models, the quadratic rate model (mean r = . ) fitted the current data best for ( %) countries as determined by r values. the simple square and quadratic models accurately predicted the number of future total cases and days in advance respectively, compared to only days for the simple exponential model. the simple exponential model significantly overpredicted the total number of future cases while the quadratic and simple square models did not. these results demonstrated that accurate future predictions of the case load in a given country can be made significantly in advance without the need for complicated models of population behavior and generate a reliable assessment of the efficacy of current prescriptive measures against disease spread. on march , the world health organization (who) declared the novel coronavirus outbreak (sars-cov- causing covid- ) as a pandemic more than three months after the first cases of pneumonia were reported in wuhan, china in december, . from wuhan the virus rapidly spread globally, currently leading to ten million confirmed cases and half a million deaths around the world. although coronaviruses have a wide range of hosts and cause disease in many animals, sars-cov- is the seventh named member of the coronaviridae known to infect humans . an infected individual will start presenting symptoms an average of days after exposure but approximately % of infected individuals remain asymptomatic , . furthermore, almost six out of infected patients die globally due to covid- . currently, treatment and vaccine options for covid- are limited . there is currently no effective or approved vaccine for sars-cov- although a report from april noted active vaccine projects, most of them at exploratory or pre-clinical stages . as the virus is transmitted mainly from person to person, prevention measures include social distancing, self-isolation, hand washing, and use of masks. strict measures of quarantine have been shown as the most effective mitigation measures, reducing up to % of expected cases compared to no intervention . nevertheless, to evaluate the actual effectiveness of any mitigation measure it is necessary to accurately predict the expected number of cases in the absence of intervention. while there has been some early concern about the ability of sars-cov- to spread at an apparent near exponential rate , real limitations in available resources (i.e. susceptible population) will reduce the spread to a logistic growth rate . logistic growth produces a sigmoidal curve ( figure ) where the total number of cases (n) eventually asymptotically approaches the population carrying capacity (nm), which for viral epidemics is analogous to the fraction of the population that will be infected before "herd immunity" is achieved , . this is represented in derivative form by the generalized logistic function (equation ): where α, β, & γ are mathematical shape parameters that define the shape of the curve, and r is the general rate term, analogous to the standard epidemiological parameter, r , the reproductive number, which is a measure of the infectivity of the virus itself , . for a logistic curve where α = ½ and β = γ = , one gets quadratic growth with n = (rt/ ) , while for α = β = γ = , this equation can be rearranged to quadratic form (equation ) traditionally the number of cases that will occur in an epidemic like covid- is modeled with an seir model (susceptible, exposed, infected, recovered/removed), in which the total population is divided into four categories: susceptible -those who can be infected, exposedthose who in the incubation period but not yet able to transmit the virus to others, infectious -those who are capable of spreading disease to the susceptible population, and recovered/removedthose who have finished the disease course and are not susceptible to re-infection or have died. for a typical epidemic, the ability for infectious individuals to spread the disease is proportional to the fraction of the population in the susceptible category with "herd immunity" , and extinction of the epidemic occurs once a limiting fraction of the population has entered into the recovered/removed category . however, barriers to transmission, either natural before knowing the actual outcome) are preferable to retrospective analysis in which effectiveness is gauged after the results of the prescriptive actions are known , . this study aimed to evaluate if a simple model was able to correctly prospectively predict the total number of cases at a future date. we found that fitting the case data to a quadratic (parabolic) rate curve for the early points in the epidemic curves (before the mitigation efforts began to have effects) was easy, efficient, and made good predictions for the number of cases at future dates despite significant national variation in the start of the infection, mitigation response, or economic condition. data on the number of covid- cases was downloaded from the european centre for disease prevention and control (ecdc) on june , . countries that had reported the highest numbers of cases in mid-march (and russia) were chosen as the focus of our analysis to minimize statistical error due to small numbers. the total number of cases for each country was calculated as a simple sum of that day plus all previous days. days that were missing from the record were assigned values of zero. the early part of the curve was fit and statistical parameters were generated using prism (graphpad) using the non-linear regression module using the program standard centered second order polynomial (quadratic), exponential growth, and the gompertz growth model as defined by prism , and a simple user-defined simple square model (n = at + c) where n is the total number of cases, a and c are the fitting constants, and t is the number of days from the beginning of the epidemic curve. the beginning of the curve (si table ) was defined empirically among the first days in which the number of cases began to increase regularly. typically, this occurred when the country had reported less than total cases. the early part of the curve was defined by manual examination looking for changes in the curve shape and later confirmed by r values for the quadratic model. prospective predictions for the number of cases were done by fitting the total number of covid- cases for each day starting with day and then extrapolating the number of cases using the estimated model parameters to predict the number of cases for the final day for which data was available (june , ) or to the last day before significant decrease in the r value for the quadratic fit. fit parameters for the gompertz growth model were not used to make predictions if the fit itself was ambiguous. acceptable predictions were defined as being within a factor of two from the actual number (i.e. predictions within - % of the actual total). a simple exponential growth model is a poor fit for the sars-cov- pandemic: the total number of cases for each of countries was plotted with time and several model equations were fit to the early part of the data before mitigating effects from public health policies began to change the rate of disease spread. in total, ( %) countries showed mitigation of disease spread by june (figure ). when the early, pre-mitigation portion of the data was examined for all countries, the gompertz growth model had the best statistical parameters (mean r = . ± . , table ) although a fit could not be obtained for the data from countries and many of the fit values for nm were unrealistic compared to national populations (e.g. china and india had predicted nm values corresponding to . % and . % of their populations respectively (si table )). fitting was also incomplete for the generalized logistic model for all countries underlining the difficulty in applying this model. on the other hand, the simple models were able to robustly fit all the current data, with the quadratic (parabolic) model performing the best (mean r = . ± . ) and the exponential model the worst (mean r = . ± . )( table ). in only three ( %) countries did the exponential model have the best overall r value among the simple models. furthermore, the trend of the overall superiority of the gompertz model followed by the quadratic was also observed in the standard error of the estimate statistic as well. the mean standard error of the estimate (sy.x, analogous to the root mean squared error for fits of multiple parameters) value for the countries was for the gompertz model, for the quadratic model, for the simple square model and for the exponential model (table ) . likewise, plots of the natural log of the total number of cases in the early parts of the epidemic (lnn) with time are significantly less linear (as determined by r ) than equivalent plots of the square root of the total number of cases (n / ) (si table , si figs , ). while logistic growth models have been widely used to model epidemics , , uncertainties in estimates of r (and therefore the population carrying capacity nm) make prospective predictions of the course of the epidemic difficult , . (figure , table , si table ). here we define predictions as accurate when they are within a factor of two ( - %) of the actual outcome. for most countries, the simple exponential model massively overpredicts the number of future cases. predictions generated more than days prior were more than double the actual number of cases for ( %) countries examined. in fact, for ( %) countries, the exponential model made at least one overprediction by a factor of greater than , fold, while the quadratic and simple square models make no overprediction by more than a factor of . and . , respectively (i.e. using the first days of data from portugal the exponential model predicts million cases while the quadratic, simple square, and gompertz growth models predict , and cases respectively while total cases were observed while the total population of portugal in was . million ). predictions using the quadratic and simple square models were much more accurate. only in four ( %) countries does the quadratic model ever overpredict the final number of cases by more than a factor of two while the simple square model overpredict by a factor of two for only one ( %) country (si table ). for the quadratic model, the mean maximum daily overprediction was a factor of . -fold (median . fold) while for the simple square model the mean maximum daily overprediction was . -fold (median . fold). both of these models produced much more accurate predictions than the simple exponential model (table ) . the (table ) , and this may account for the conflation of the course of the sars-cov- pandemic with truly exponential growth. that the exponential growth constant term, k, is constantly decreasing after day in ( %) countries (si fig. ) further indicates the overall utility of logistic models, which were explicitly developed to model the a constantly decreasing rate of growth due to consumption of the available resource (i.e. the susceptible population pool of the sir model) . but, while logistic models are implicitly the correct model, they are difficult to accurately fit during the early portion of an epidemic due to inherent uncertainties in the mathematical shape parameters (equation ) of the curve itself and the population carrying capacity for sars-cov- , nm, which still has a significant uncertainty as the virus has only recently moved into the human population. herd immunity is defined as - /r , and since current estimates for r vary from . - . . this implies that - % of the population will need to have contracted the disease and developed immunity in order to terminate the epidemic. a discrepancy of this size will significantly affect predictions based on logistic growth models. here we note the utility of the quadratic (parabolic) and simple square models in predicting the course of the pandemic more than a month in advance. the simple exponential model vastly overpredicts the number of cases (fig. , table ). the gompertz growth model, while often making largely correct predictions often generates wildly inaccurate estimates of the population carrying capacity nm (si table ), and the generalized logistic model simply fails to produce a statistically reliable result with the currently available data. overestimation of the future number of cases will cause problems because the failure of the number of predicted cases to materialize may be erroneously used as evidence that poorly implemented and ineffective policy prescriptions are reducing the spread of sars-cov- , which may lead to political pressure for premature cessation of all prescriptive measures and inevitably an increase in the number of cases and excess, unnecessary morbidities. fortunately, the quadratic model produces accurate, prospective predictions of the number of cases (fig. , table ). use of this model is simple as it is directly implemented in common spreadsheet programs and can be implemented without much difficulty or technical modeling expertise. in theory, this model can also be applied to smaller, sub-national populations, although the smaller number of total cases in these regions will undoubtedly give rise to larger statistical errors. in no way does the empirical agreement between the quadratic model and empirical data negate the fact that the growth of the sars-cov- epidemic is logistic in nature in all countries (table , si table ). we expect the suitability of these empirical quadratic fits is related to either the fact that quadratic form of the slope of the generalized logistic function or the limitation of the virus to a physical radius of infectivity around infectious individuals, or that it is still early in the pandemic as no country has yet officially logged even % of its population as having been infected, or all three. of course, the true number of covid- cases is a matter of debate as there is speculation that a significant fraction of infections are not being identified . however, because this method is focused on the rate of case growth over time, the errors that lead to any undercounting within a given country are likely to remain largely unchanged over the short time periods observed here and still provide a reasonable estimate of the number of positively identified cases. while despite their similar predictive power we largely focus on the quadratic model rather than the simple square model for the aforementioned reasons, we must also note that quadratic curve fitting is natively implemented in most common spreadsheet software while the simple square model is not. by monitoring the r values for the quadratic models, it is a simple task to identify when the epidemic is beginning to subside within a country (i.e. "bending the curve"). here we recommend the use of an r value of . for identifying when the rate of infection is beginning to subside, but more conservative estimates can also be made by lowering this threshold. examination of the data collected here suggests that early, aggressive measures have been most effective at reducing disease burden within a country. countries that initially adopted less stringent measures (such as the us, uk, russia, and brazil) are currently more heavily burdened than those countries that started with more intense prescriptions (such as china, south korea, australia, denmark, and vietnam) both logistic curves is the same, the gompertz curve reaches the population carrying capacity more slowly, resulting in a long tailed epidemic. the initial part of the gompertz curve (including time points until % of the population has been infected) was fit to the simple exponential (red dashes), quadratic (blue dashes) and simple square (green dashes) models. it is apparent from these curves how quickly the exponential curve overestimates the rate of growth for the epidemic as compared to the quadratic and simple square fit curves and how the quadratic model more closely follows the gompertz growth curve, evidenced by the smaller sy.x value for the quadratic fit in table . for each model using only data up to that day are used to predict the number of expected cases for the last day for which data is available (or the last day before significant curve deviation is observed, see figure ). days on which the fit was not statistically sound were omitted from the graph. effectiveness" blanco et al. table : the fit parameters for the development of the early portion of the sars-cov- epidemic in countries for the exponential, quadratic, simple square, simple exponential, gompertz growth models as calculated for each individual day during the early portion of the epidemic. a a the fit equations for each are as follows: simple exponential: where n is the total number of cases, t is the time in days, n is the initial seeding population of the epidemic, nm is the population carrying capacity (the amount of the population that must be infected to achieve herd immunity), a, b & c are the standard quadratic terms (or for the simple square model equation). additionally, the number of days of data used in the fitting, the r , sum of squares, and sy.x statistical values are given. for the gompertz growth model, an adequate fit could not be achieved for brazil or denmark and this is indicated by figure : the change of the exponential rate term (k) over time for each of the countries. it can be clearly seen that k is generally decreasing over time, often on each day but sometimes after an initial bit of increasing. this indicates that the exponential rate is regularly decreasing, as expected for a situation where growth resource is decreasing, as is expected for the logistic models family of models, including the generalized logistic and gompertz growth models. coronaviruse disease (covid- ) situation report - ?sfvrsn= ba e _ : world health organization covid- : epidemiology, evolution, and cross-disciplinary perspectives the epidemiology and pathogenesis of coronavirus disease (covid- ) outbreak estimating the asymptomatic proportion of coronavirus disease (covid- ) cases on board the diamond princess cruise ship estimation of the asymptomatic ratio of novel coronavirus infections (covid- ) real estimates of mortality following covid- infection impact assessment of non-pharmaceutical interventions against coronavirus disease and influenza in hong kong: an observational study the covid- vaccine development landscape interventions to mitigate early spread of sars-cov- in singapore: a modelling study real-time forecasts of the covid- epidemic in china fromfebruary th to february th herd-immunity to helminth infection and implications for parasite control herd immunity'': a rough guide the reproductive number of covid- is higher compared to sars coronavirus piecewise quadratic growth during the novel coronavirus epidemic the use of gompertz models in growth analyses, and new gompertzmodel approach: an addition to the unified-richards family dynamics of tumor growth the impact of a physical geographic barrier on the dynamics of measles can china's covid- strategy work elsewhere? effective containment explains subexponential growth in recent confirmed covid- cases in china covid- national emergency response center e cmtk, prevention. contact transmission of covid- in south korea: novel investigation techniques for tracing contacts using social and behavioural science to support covid- pandemic response image data collection: prospective predictions are the future cohort studies: prospective versus retrospective rational evaluation of various epidemic models based on the covid- data ofchina model selection and evaluation based on emerging infectious disease data sets including a/h n and ebola national response to covid- in the republic of korea and lessons learned for other countries the french response to covid- : intrinsic difficulties at the interface of science, public health, and policy covid- healthcare demand and mortality in sweden in response to non-pharmaceutical (npis) mitigation and suppression scenarios what policy makers need to know about covid- protective immunity high population densities catalyse the spread of covid- high temperature and high humidity reduce the transmission of covid- substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (sars-cov- ) whose coronavirus strategy worked best? scientists hunt most effective policies and gompertz growth models based on days of good predictions before the target, last day of observed data, inclusive. b b good predictions are defined as the predicted result being within a factor of , predictions from - % of the actual total number of cases). thus, the quadratic model was able to predict the total number of cases in the united states in each of the days before that day, while the exponential model was only within the defined good range for the days preceding that day. the range of minimum predictions (under predictions) and maximum predictions (overpredictions) is also given si figure : plots of the square root of the total number of cases (√n) for the early portion of the covid- si figure : plots of the natural log of the total number of cases (lnn) for the early portion of the covid- key: cord- -pw ghz m authors: july, julius; pranata, raymond title: impact of the coronavirus disease pandemic on the number of strokes and mechanical thrombectomies: a systematic review and meta-analysis: covid- and stroke care date: - - journal: j stroke cerebrovasc dis doi: . /j.jstrokecerebrovasdis. . sha: doc_id: cord_uid: pw ghz m background: this systematic review and meta-analysis aimed to evaluate the impact of the coronavirus disease (covid- ) pandemic on stroke care, including the number of stroke alerts/codes, number of reperfusions, and number of thrombectomies during the pandemic compared to those during the pre-pandemic period. methods: a systematic literature search was performed using the pubmed, europepmc, and cochrane central databases. the data of interest were the number of strokes, reperfusions, and mechanical thrombectomies during the covid- pandemic versus that during the pre-pandemic period (in a historical comparator group over a specified period of same period length). results: the study included , subjects from studies. meta-analysis showed that the number of stroke alerts during the pandemic was % ( - %) of that during the pre-pandemic period. the number of reperfusion therapies during the pandemic was % ( - %) of that during the pre-pandemic period. pooled analysis showed that the number of mechanical thrombectomies performed during the pandemic was % ( - %) of that during the pre-pandemic period. the number of mechanical thrombectomies per stroke patient was higher during the pandemic (or . [ . - . ], p< . ; i( ): %, p= . ). conclusion: this meta-analysis showed that the number of stroke alerts, reperfusions, and mechanical thrombectomies was reduced by %, %, and %, respectively, during the pandemic. however, the number of patients receiving mechanical thrombectomy per stroke increased. as of th july , there have been , , cases of coronavirus disease (covid- ) , which has resulted in , deaths. although most covid- patients are asymptomatic, a significant proportion develops severe manifestations that may lead to death. patients with advanced age and preexisting comorbidities are more likely to develop severe disease. [ ] [ ] [ ] [ ] [ ] [ ] the incidence of acute conditions such as myocardial infarction was shown to be -reduced‖ during the pandemic. because of the looming possibility of in-hospital covid- transmission, people are more reluctant to visit the hospital. this was reflected in a study on patients with myocardial infarction, which showed that % of patients avoided hospitals because of the fear of covid- transmission. similarly, a study showed that the covid- pandemic disrupted prehospital and in-hospital care, resulting in a significant drop in admissions, thrombolysis, and thrombectomy. however, other factors such as decreased pollution during pandemic may decrease myocardial infarction or stroke. , delays in the prehospital and in-hospital chain may lead to a prolonged time to presentation, beyond the golden hour for reperfusion, which may reduce the number of reperfusion and thrombectomy. less result may result in increased morbidity and mortality. this systematic review and meta-analysis aimed to evaluate the impact of this pandemic on stroke care, including the number of stroke alerts/codes, number of reperfusions, and number of thrombectomies during the covid- pandemic compared to the pre-pandemic period. a systematic literature search was performed using the keywords -stroke,‖ -cerebrovascular diseases,‖ -cerebral infarction,‖ -brain ischemia,‖ -stroke alert,‖ or -thrombectomy‖ and -covid- ‖ or -coronavirus disease ‖ or - n-cov,‖ or -sars-cov- ‖ or -pandemic‖ on pubmed, europepmc, and cochrane central database. hand-sampling from potential articles cited by other studies was also used to identify published studies from january to june . this initial search was performed by two independent researchers, and the resulting discrepancies were solved by discussion. inclusion and exclusion criteria were then applied to the retrieved records. the inclusion criteria for this systematic review and meta-analysis were original articles, research letters, short reports, and case series containing primary data. the data of interest were the number of strokes, reperfusions, and mechanical thrombectomies during the covid- pandemic versus the comparator period. the comparator group was a historical control over a specified period of time (same period length) before the pandemic. the exclusion criteria were preprints, case reports, review articles, and articles in a non-english language. data extraction and quality assessment were performed by two independent researchers using extraction forms for the first author, year of publication, sample size, study design, number of stroke alerts/codes, number of reperfusions, and number of mechanical thrombectomies. quality assessment was performed using oxford cebm critical appraisal tool. the number of reperfusions was defined as the number of patients with acute ischemic stroke receiving thrombolysis and mechanical thrombectomy alone or with intra-arterial thrombolysis. mechanical thrombectomy was defined as revascularization for acute ischemic stroke with either mechanical thrombectomy alone or with the addition of intra-arterial thrombolysis. meta-analysis of proportion was used to determine the number of stroke alerts/codes, reperfusions, and mechanical thrombectomies during the pandemic compared to that during the historical pre-pandemic control period. stata . (statacorp llc) was used to perform the meta-analysis with a random-effects model, regardless of heterogeneity. heterogeneity was calculated using i statistics and cochran's q test, in which i > % or p< . indicated significant heterogeneity. assessment for publication bias and metaregression analysis was not performed because of the limited number of studies (< studies). there were a total of , records, of which , remained after removal of duplicates. a total of , records were excluded after screening the titles and abstracts, leaving eligible studies. after the full texts were screened for eligibility, we excluded more articles for the following reasons: ) no data on outcome of interest (n= ), ) preprint (n= ), or ) report was per , inhabitants (n= ). we included studies for our qualitative and quantitative analyses [ figure ]. there were a total of , subjects from studies. , [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] characteristics of the included studies are presented in table . the meta-analysis showed that the number of stroke alerts during the pandemic was % ( - %) of that during the pre-pandemic period [ figure ]. the number of reperfusion therapies undertaken during the pandemic was % ( - %) of that during the pre-pandemic period [ figure ]. pooled analysis showed that the number of mechanical thrombectomies performed during the pandemic was % ( - %) of that during the pre-pandemic period [ figure ]. however, the number of mechanical thrombectomies per stroke patient was higher during the pandemic (or . [ . - . ], p< . ; i : %, p= . ). a meta-analysis of studies showed that the number of stroke alerts/codes, reperfusions, and mechanical thrombectomies was less during the pandemic period than during the prepandemic period. however, the number of patients receiving mechanical thrombectomy per stroke increased. the number of strokes was lower during the pandemic; this might be explained by hospital avoidance due to fear of contracting the covid- virus. nevertheless, further research is needed to better understand the reasons for not seeking care. it should be noted that covid- is associated with coagulopathy , and may increase the risk of stroke. although the number of covid- -related stroke cases is unclear-along with the non-compliance/nonadherence to chronic medications (due to accessibility or drug rumours unproven to be true) and sedentary lifestyle -the incidence of stroke might actually be higher than is usual. reduction in the number of strokes in this pooled analysis reflects the unmet need for medical attention. however, reduced pollution due to lockdown and other restriction, shift in dietary patterns such as decreased consumption of high-sodium, fast-food intake, may lead to a decreased acute cardio-cerebrovascular events related to pollution. , disruption in prehospital and in-hospital care may lead to delayed onset-to-door time. teo et al. showed that only . % of the patients were attended to within . hours during the pandemic compared to . % during the pre-pandemic period. delayed presentation may lead to a reduced number of salvageable areas and also reduce the number of patients eligible for thrombolysis. , our meta-analysis showed that the number of reperfusion episodes during the pandemic was only two-third of that during the prepandemic period. our result strengthens the conclusion of prior observation that showed marked fall in stroke presentations and services due to covid- in april . , mechanical thrombectomy-which carries the risk of transmission to health care workershas become the procedure of choice in the late presenters. the absolute number of mechanical thrombectomies performed was reduced; however, the number of patients receiving thrombectomy per patient with a stroke actually increased. this is possibly because of an increase in late presenters. another possibility could be due to differences in proportions of patients with large vessel occlusion that are eligible for thrombectomy, or a greater tendency of more severe strokes, also those who would be candidates for thrombectomy, to present to emergency department during the pandemic compared to those with less severe stroke. regardless of severity of stroke, people experiencing symptoms of cardio-cerebrovascular diseases should be encouraged to seek emergency medical care. the limitation of this systematic review and meta-analysis is that the limited data on clinical outcomes prohibit us from calculating the impact of the pandemic in terms of morbidity and mortality. the out-of-hospital deaths in specific regions, which may provide insight on patients with unmet medical needs, cannot be addressed in this meta-analysis. there might be an increasing trend of mechanical thrombectomy due to advances in technology and an increase/decrease in the real stroke incidence. there is no data on other possible factors such as pollution and dietary patterns that may affect the incidence of stroke. the data on proportion of patients with large vessel occlusion eligible for thrombectomy during pandemic compared to pre-pandemic is not thoroughly available. this meta-analysis showed that the number of stroke alerts/codes, reperfusions, and mechanical thrombectomies was reduced by %, %, and %, respectively, during the pandemic. however, the number of patients receiving mechanical thrombectomy per stroke increased. *data was not presented by grouping of the pandemic period vs. the pre-pandemic period. world health organization. coronavirus disease (covid- ) situation report - impact of cerebrovascular and cardiovascular diseases on mortality and severity of covid- -systematic review, meta-analysis, and meta-regression effect of chronic obstructive pulmonary disease and smoking on the outcome of covid- hypertension is associated with increased mortality and severity of disease in covid- pneumonia: a systematic review, meta-analysis and meta-regression diabetes mellitus is associated with increased mortality and severity of disease in covid- pneumonia -a systematic review, meta-analysis, and meta-regression lymphopenia in severe coronavirus disease- (covid- ): systematic review and meta-analysis elevated n-terminal pro-brain natriuretic peptide is associated with increased mortality in patients with covid- : systematic review and meta-analysis the covid- pandemic and the incidence of acute myocardial infarction impact of covid- pandemic on stelevation myocardial infarction in a non-covid- epicenter impact of the covid- epidemic on stroke care and potential solutions fewer hospitalizations for acute cardiovascular conditions during the covid- pandemic a time-to-event analysis on air pollutants with the risk of cardiovascular disease and mortality: a systematic review and meta-analysis of cohort studies mechanical thrombectomy for acute ischemic stroke amid the covid- outbreak has covid- played an unexpected -stroke‖ on the chain of survival? impact of the covid- outbreak on acute stroke pathways -insights from the alsace region in france acute stroke care is at risk in the era of covid- : experience at a comprehensive stroke center in barcelona delayed presentation of acute ischemic strokes during the covid- crisis delays in stroke onset to hospital arrival time during covid- impact of the covid- pandemic on the process and outcome of thrombectomy for acute ischemic stroke need for ensuring care for neuro-emergencies-lessons learned from the covid- pandemic c-reactive protein, procalcitonin, d-dimer, and ferritin in severe coronavirus disease- : a metaanalysis multiorgan failure with emphasis on acute kidney injury and severity of covid- : systematic review and meta-analysis the use of renin angiotensin system inhibitor on mortality in patients with coronavirus disease (covid- ): a systematic review and meta-analysis a wave of non-communicable diseases following the covid- pandemic acute stroke intervention approaches to global stroke care during the covid- pandemic covid- and stroke-a global world stroke organization perspective mechanical thrombectomy in the era of the covid- pandemic: emergency preparedness for neuroscience teams what has caused the fall in stroke admissions during the covid- pandemic? the authors declare that they have no competing interests. key: cord- -bpv p zo authors: pequeno, pedro; mendel, bruna; rosa, clarissa; bosholn, mariane; souza, jorge luiz; baccaro, fabricio; barbosa, reinaldo; magnusson, william title: air transportation, population density and temperature predict the spread of covid- in brazil date: - - journal: peerj doi: . /peerj. sha: doc_id: cord_uid: bpv p zo there is evidence that covid- , the disease caused by the betacoronavirus sars-cov- , is sensitive to environmental conditions. however, such conditions often correlate with demographic and socioeconomic factors at larger spatial extents, which could confound this inference. we evaluated the effect of meteorological conditions (temperature, solar radiation, air humidity and precipitation) on daily records of cumulative number of confirmed covid- cases across the brazilian capital cities during the st month of the outbreak, while controlling for an indicator of the number of tests, the number of arriving flights, population density, proportion of elderly people and average income. apart from increasing with time, the number of confirmed cases was mainly related to the number of arriving flights and population density, increasing with both factors. however, after accounting for these effects, the disease was shown to be temperature sensitive: there were more cases in colder cities and days, and cases accumulated faster at lower temperatures. our best estimate indicates that a °c increase in temperature has been associated with a decrease in confirmed cases of %. the quality of the data and unknowns limit the analysis, but the study reveals an urgent need to understand more about the environmental sensitivity of the disease to predict demands on health services in different regions and seasons. the disease covid- , caused by the betacoronavirus sars-cov- , has caused panic throughout the world by overwhelming medical services in many countries, leading to deaths that might have been avoided if patients had access to intensive-care units (icus). this has led to an unprecedented collaboration within and among countries to slow the spread of the disease, principally using social distancing (ebrahim et al., ; wilder-smith & freedman, ) . while it is not clear how much present policies will reduce overall infection rates by sars-cov- , there is consensus that slowing the spread of the disease will save lives by tailoring patient demands to the capacity of health systems walker et al., ) . the strategies of social isolation applied in countries on all continents have allowed time for authorities to undertake interventions to strengthen their health systems, and one of the main actions is to estimate the number of cases of covid- in each region (anderson et al., ; walker et al., ) . this information is essential to scale the number of icus to the number of critically ill patients who normally require supportive lung ventilation . brazil has a large per capita number of icus in comparison with europe, but those units are not evenly spread among regions, with more icus per capita in southern states than in northern regions, leaving many brazilians at large distances from the nearest icu (rhodes & moreno, ) . moreover, the large demographic and socioeconomic discrepancies in the country create significant variation in susceptibility to infectious diseases (barreto et al., ) . one of the problems in predicting the demand for hospital services is that the disease is new so that its behavior is still poorly understood, and the virus may be evolving rapidly (zhao et al., ; yang et al., ; morais et al., ) . therefore, models developed in one country may give poor predictions in another. habitat-specificity modeling suggests that sars-cov- spread may be related to environmental conditions, especially temperature and humidity (sajadi et al., ; wang et al., ) . further, at the host level, there is circumstantial evidence that covid- is related to shortage of vitamin d, which could result from limited exposition to solar radiation (grant et al., ) . indeed, it has been suggested that solar radiation might deactivate the virus (poole, ) . although preliminary, these results provide a plethora of mechanistic processes linking weather and virus spread that need to be better understood. brazil is one of the largest countries in the world, spanning both hemispheres, with latitudes varying from n to s. this means that climatic conditions vary greatly and simple models that do not take into account the possible environmental sensitivity of covid- might not be adequate to predict when and where there will be the greatest demand for health services in brazil (fig. ). one difficulty in quantifying this sensitivity is that climate is likely to correlate with demographic and socioeconomic factors across larger spatial extents. thus, environmental effects could be confounded unless risk factors for viral spread are taken into account, such as population density, transport connectivity and economic status (poole, ; wang et al., ; ribeiro et al., ) . in an attempt to determine whether environmental variables have significant effects on the propagation of covid- , we modeled the daily cumulative number of confirmed cases among brazilian capital cities in relation to meteorological variables during the st month of the disease in the country, while controlling for several demographic and socioeconomic factors. we used only capital cities because they are presently the only reliable sources of covid- cases and represent much of the climatic variation within brazil. data on connectivity and frequency of cases is not presently adequate to model the spread of the disease at the municipal level, but this should be available in the future and can be used to test our hypotheses. we obtained daily cumulative counts of confirmed cases of covid- for each of the brazilian state capital cities, as reported by state health secretaries and compiled by volunteers (table s ; secretarias de saúde das unidades federativas, ). march, (n = ); (secretarias de saúde das unidades federativas, ), superimposed on the country's thermal variability. temperature data represent means for march over - (fick & hijmans, ) , and were used only in the map; actual analyses used current, daily meteorological data. full-size  doi: . /peerj. / fig- we focused on the month since the first confirmed case, from february to march , , for which there were reports on daily counts across cities. we considered several potential predictors of the number of confirmed cases. first, it was important to account for the number of tests for covid- , as performing more tests tends to reveal more positive cases (roser et al., ) . the brazilian government has not been systematically reporting the number of performed tests, but has recommended testing of all suspected patients with severe symptoms, and the ministry of health reported the number of suspected cases per state until march , . therefore, we used the number of suspected cases per state on that date as a proxy for the number of tests, under the reasonable assumption that states with more suspected cases performed more tests. further, we considered the following predictors: ( ) time in days, to account for the exponential growth in case numbers during this period (fig. ) ; ( ) number of arriving flights in the city's metropolitan area in , as airline connections can facilitate the spread of the virus (ribeiro et al., ) ; ( ) city population density, to account for facilitation of transmission under higher densities (poole, ) ; ( ) proportion of elderly people (≥ years old) in the population, assuming that the elderly may be more likely to show severe symptoms of sars-cov- and, thus, to be diagnosed with covid- ; ( ) citizen mean income, which may affect the likelihood of people being infected by the virus, for example, due to limited access to basic sanitation or limited social isolation capabilities; ( ) and the following meteorological variables: mean daily temperature ( c), mean daily solar radiation (kj/m ), mean daily relative humidity (%) and mean daily precipitation (mm). the number of suspected cases and socioeconomic variables only varied across cities, whereas meteorological variables varied both between and within cities. data on population density, the elderly and income were obtained for the last quarter of from the brazilian institute for geography and statistics (ibge), which samples brazilian households quarterly for socioeconomic indicators (sidra, ) (table s ) . flight data were obtained from the current statistical annuary of the brazilian agency for civilian aviation (anac) ( table s ; anac, ). hourly meteorological data were obtained from the automatic stations maintained by the brazilian institute for meteorology (inmet) in the capital cities (table s ; inmet, ). we investigated the response of case counts to putative predictors using a generalized linear mixed model (glmm) assuming poisson-distributed errors and log link, and using capital-city identity as a random factor to account for autocorrelated errors within cities. this formulation induces a compound symmetry correlation structure on residuals within cities, which is mathematically equivalent to that of classical, "repeated measures" linear models (zuur et al., ) . the numbers of suspected cases and arriving flights were log-transformed to account for their highly skewed distributions, and all predictors were scaled to zero mean and unit standard deviation to facilitate parameter estimation. consequently, estimated coefficients were scaled, thus providing a measure of predictor relative importance. specifically, assuming a log link, a change of one unit in the scaled predictor implies a mean percent change of (exp(coefficient/sd) − ) × in the number of confirmed cases, where "coefficient" and "sd" are the predictor's model coefficient and standard deviation, respectively. we considered time lags in the effect of meteorological conditions. incubation time of covid- averages days (lauer et al., ) , and case confirmation in brazil has taken from several days to weeks due to overload of test laboratories. therefore, the time between infection and case confirmation is likely to be longer than a week. accordingly, we considered a set of models including all predictors but varying in the number of days meteorological predictors were lagged relative to case counts, ranging from to days with daily steps. then, models were compared with akaike's information criterion (aic), a standard measure of model relative support, and the model with the lowest aic was judged as the most supported. precipitation time series had missing intervals for some capital cities. therefore, we performed two versions of the above analysis: one including all predictors but excluding days for which precipitation was lacking (n = ) and another one excluding precipitation as predictor and using all counts of confirmed covid- cases (n = ). because both analyses produced largely similar results, with precipitation having a negligible model coefficient (figs. s and s ; table s ), we present the analysis using the larger sample. lastly, we considered possible interactions between time and other predictors, assuming some factors could accelerate the temporal increase in number of confirmed cases. by definition, glmms assuming a log link implicitly account for interactive effects to some degree, as log-linear models imply multiplicative effects. still, we ran a separate glmm which explicitly included interaction terms between time and the remaining predictors. to avoid model overparameterization, we only used the significant predictors identified in the previous analysis. for all models, we computed the conditional predictive power (r c ), which indicates the variance explained jointly by predictors and the random factor, and the marginal predictive power (r m ), which only considers predictor effects. in these calculations, only significant predictors were retained in the model to avoid inflation of explained variance due to spurious parameters. we acknowledge that it would be better to directly model the spread of sars-cov- , but we cannot do that without making assumptions about the relationship between infection by the virus and the appearance of symptoms of the disease, which may be related to the factors that we are investigating. all analyses were performed in r . . (r core team, ), with aid of packages "coronabr" (mortara, sánchez-tapia & martins, ) and "covid br" (paterno, ) for assessing counts of suspected cases by state, "lme " for glmm (bates et al., ) , "mumin" for aic and r² calculations (barto n, ), and "visreg" for visualization of predictor effects (breheny & burchett, ) . there was strong support for the model whose meteorological predictors were lagged by days, as indicated by its much lower aic (fig. ) . according to this model, the only significant predictors of the number of confirmed covid- cases were time, the number of arriving flights, population density and temperature (table ). the number of confirmed cases increased with time (fig. a ), the number of arriving flights (fig. b ) and population density (fig. c) , whereas it decreased with temperature (fig. d ). considering that model coefficients were scaled, comparing their values gives an indication of the relative importance of each predictor. accordingly, time, the number of arriving flights and population density had the strongest effects (i.e., largest coefficients), followed by temperature (table ) . nevertheless, a change in c predicted a decrease in the number of confirmed cases by (exp(− . / . ) − ) × = %, independently of other factors (table ). significant predictors explained % of the variance of daily counts of confirmed covid- across capital cities in brazil. explicitly accounting for interactions between time and the remaining predictors identified in the previous analysis suggested significant interactions between time and the number of arriving flights, and time and temperature, although the magnitude of the interaction coefficients was low (table ) . on average, the temporal increase in confirmed covid- cases began earlier in cities with more flights, causing a leftward shift in the relationship between confirmed cases and time (fig. a ). in parallel, the number of confirmed cases increased faster at lower temperatures, causing a steeper slope in the relationship between confirmed cases and time (fig. b) . however, these effects were relatively weak, and there was no improvement in predictive power (table ) . thus, the previous, simpler model captured the main patterns in the data very well. table results of the most supported generalized linear mixed model (glmm) testing for independent effects on daily cumulative counts of confirmed covid- across the capital cities in brazil (n = ; r c = . ; r m = . ). the model assumed poisson-distributed errors and log link, and used capital city identity as a random factor to account for autocorrelated errors of time series within cities. all predictors were scaled to zero mean and unit standard deviation. sd indicates predictor standard deviation; numbers in bold represent statistically significant effects (p < . ). variables were as follows: time-time elapsed in days; log suspected-log-transformed number of suspected covid- cases in march , ; log flights-log-transformed number of arriving flights in ; density-inhabitants by km ; elderly-proportion of elderly people (≥ years old); income-mean citizen income (r$); temperature-mean daily temperature ( c) with a -day lag; radiation-mean daily solar radiation (kj/m ) with a -day lag; humidity-mean daily air humidity (%) with a -day lag. sd coefficient z p our results indicate that the number of confirmed covid- cases in brazil has been higher and has begun to increase earlier in cities receiving more flights, consistent with the expected role of air-transport connections in spreading the virus across the country (ribeiro et al., ) . further, there have been more cases in cities with higher population density, consistent with the expected role of host density (poole, ) . we also have uncovered a temperature response with a lag of days: there have been more confirmed cases in colder cities and days, and confirmed cases have accumulated faster under lower temperature days. although correlative, these patterns were independent of several demographic and socioeconomic factors and, thus, are unlikely to be confounded by them. this is disturbing because there is little that authorities can do about this relationship, log number of confirmed covid− cases (partial residuals) table . the model assumed poisson-distributed errors and log link, and included capital city identity as a random factor to account for autocorrelated errors in time series within cities. each point represents a daily observation in a given city (n = ); lines represent predicted means for each group of observations, as indicated by legends. group medians were chosen based on their respective predictor ranges (see fig. ). plots use partial residuals of the response variable and thus show the effect of a given interaction while controlling the effects of remaining predictors. full-size  doi: . /peerj. /fig- table results of the generalized linear mixed model (glmm) testing for interaction effects on daily cumulative counts of confirmed covid- in brazil (n = ; r c = . ; r m = . ). the model assumed poisson-distributed errors and log link, and used capital city identity as a random factor to account for autocorrelated errors of time series within cities. only statistically significant predictors in table were used, all of which were scaled to zero mean and unit standard deviation. sd indicates predictor standard deviation; numbers in bold represent statistically significant effects (p < . ). variables were as follows: time-time elapsed in days; log flights-log-transformed number of arriving flights in ; density-inhabitants by km ; temperature-mean daily temperature ( c) with a -day lag. whereas the number of arriving flights and population density can be manipulated indirectly by isolation strategies. the temperature dependance of covid- in brazil agrees with data from china, where warmer weather also seemed to limit the spread of covid- while controlling for population density and per capita gdp . it also agrees with the thermal dependance of viability and transmission demonstrated experimentally for better-studied viruses, such as influenza (lowen et al., ) and other betacoronaviruses, for example, sars-cov- (chan et al., ) and mers-cov (van doremalen, bushmaker & munster, ) . while the precise mechanism underlying this pattern requires further study, it may be related to the lipid bilayer of coronaviruses, which becomes increasingly unstable as temperature increases (schoeman & fielding, ) . recognizing that we are talking about the rate of spread and not necessarily final mortality rates, the information is still important for authorities trying to predict demands on health services. our best estimate is that a rise of about c in mean daily temperature reduces the number of covid- cases by about %, independently of other factors. thus, for instance, our results indicate that cuiabá, with a mean july temperature of about . c, and porto alegre, with a mean july temperature of . c, may differ up to % in the number of covid- cases, all else being equal. it also means that the spread in porto alegre might be % lower in the middle of march (mean daily temperature of . c) than it will be in the middle of july (mean daily temperature of . ). at the same time, and contrary to some suggestions (poole, ; wang et al., ; grant et al., ) , we found no evidence for effects of solar radiation or humidity. perhaps such conditions are not limiting for the virus or the disease under the climatic conditions of brazil. also, rapid evolution of the climatic niche of sar-cov- could have a similar effect. although the mutation rate of sars-cov seems to be moderate compared to that of other rna viruses (zhao et al., ; yang et al., ) , clustering of hundreds of worldwide sars-cov- genomes based on widely shared polymorphisms suggests subtypes, all of which harbor amino acid replacements which may have phenotypic effects (morais et al., ) . whether the temperature effect is related to the rate of spread of sars-cov- or to the proportion of persons that suffer reportable symptoms cannot be answered with the data being provided at the moment (fasina, ) . that would require universal testing for the presence of the virus, which is not presently viable but may be a necessity in the following months. also, one of the main difficulties we encountered was the lack of systematization of current information, since much of the data generated daily is still scattered and difficult to access. for instance, the number of tests performed, which is the key to estimate the rate of infection and the number of infected patients is not available. the large number of publications on the subject in the last days shows that the scientific community is prepared for a quick response, as long as there is a systematization and transparency of information regarding the number of tests being performed, number of suspected cases, number of infected, number of deaths, etc. our models are necessarily simple and have limitations. most importantly, we need city and state administrations to provide the number of performed tests on a regular basis, so that this variable can be explicitly accounted for in the model. the data do not allow us to investigate complex nonlinear effects, which likely would require data on temperatures beyond those observed in brazilian cities in march. also, it is currently not possible to account for potential interactions between covid- and other diseases, particularly influenza, which is seasonal. it may not be necessary to worry about this in the northern hemisphere because the peaks in covid- will occur after the peaks in seasonal influenza. however, the predicted peaks in covid- in the southern hemisphere will occur concomitantly with peaks in seasonal influenza (nelson et al., ; viboud, alonso & simonsen, ) . the effects may just be additive, but it is not known whether the simultaneous infection will increase the severity of covid- and therefore the demand for icus, and this interaction also could be temperature sensitive. further, air pollution is known to increase susceptibility to viral respiratory infections (ciencewicki & jaspers, ) , but the extent to which it affects the prevalence of covid- is unclear. as covid- consolidates in different cities, it will be possible to reduce uncertainties in relation to the role of temperature and other factors. nonetheless, our models still performed well as judged by their predictive power, even when ignoring interactions between predictors. we stress that the temperature effect does not mean that the northern, warmer regions of brazil should expect fewer complications in their health care system, because such regions also have poorer socioeconomic and sanitary conditions (barreto et al., ) , and icus are concentrated in southern regions (rhodes & moreno, ) . although we found no evidence for an effect of income on the number of confirmed covid- cases, this variable is related to the capacity of cities to respond to the pandemic. furthermore, apart from elapsed time, the predictors with the largest standardized coefficients were the number of arriving flights and population density (table ) . indeed, manaus, the largest city in northern brazil, was the first brazilian city to declare the collapse of the health system early in april , which is consistent with its large number of arriving flights and large population density but relatively low number of icus (rhodes & moreno, ) . by contrast, although southern, colder regions have a higher density of icus, their situation could be aggravated if social isolation measures are not effectively adopted before and maintained throughout winter in those regions (from june to september). this should be especially important for "favelas", that is, poorer, highly populated neighborhoods with deficient infrastructure, which are presumably at high risk of infection. thus, we do not present our results as an indication of how hospital demand should be calculated, but as a warning that models for brazil need to take into account predicted temperatures. declared as a pandemic by the world health organization (who), the covid- disease has changed human behavior and strongly affected health systems and the economy worldwide. in an extremely demanding scenario, optimizing the distribution of resources is an essential task. brazil and other countries are starting to discuss the flexibilization of social distancing policies, as the latter could have important economic costs. however, we need to understand how and when to implement such decisions in order to prevent new, uncontrolled disease outbreaks that may overcrowd the health care system again and generate even higher economic costs in the near future. our results suggest that, along with arriving flights and population density, temperature should be taken into account to estimate the number of cases of covid- , especially with winter approaching in the southern hemisphere. agência nacional de aviação civil how will country-based mitigation measures influence the course of the covid- epidemic? successes and failures in the control of infectious diseases in brazil: social and environmental context, policies, interventions, and research needs mumin: multi-model inference. r package version . . fitting linear mixed-effects models using lme visualization of regression models using visreg the effects of temperature and relative humidity on the viability of the sars coronavirus air pollution and respiratory viral infection covid- and community mitigation strategies in a pandemic novel coronavirus ( -ncov) update: what we know and what is unknown impact of non-pharmaceutical interventions (npis) to reduce covid- mortality and healthcare demand worldclim : new -km spatial resolution climate surfaces for global land areas evidence that vitamin d supplementation could reduce risk of influenza and covid- infections and deaths the incubation period of coronavirus disease (covid- ) from publicly reported confirmed cases: estimation and application influenza virus transmission is dependent on relative humidity and temperature the global population of sars-cov- is composed of six major subtypes coronabr: download de dados do coronavírus stochastic processes are key determinants of short-term evolution in influenza a virus covid br: an r-package with updated data on the number of coronavirus (covid- ) cases in brazil seasonal influences on the spread of sars-cov- (covid ), causality, and forecastabililty r: a language and environment for statistical computing. vienna: r foundation for statistical computing severe airport sanitarian control could slow down the spreading of covid- pandemics in brazil intensive care provision: a global problem coronavirus disease (covid- )-statistics and research temperature, humidity, and latitude analysis to predict potential spread and seasonality for covid- coronavirus envelope protein: current knowledge dados diários mais recentes do coronavírus por município brasileiro pesquisa nacional por amostra de domicílios contínua trimestral: tabela -população por grupo de idade stability of middle east respiratory syndrome coronavirus (mers-cov) under different environmental conditions influenza in tropical regions the global impact of covid- and strategies for mitigation and suppression high temperature and high humidity reduce the transmission of covid- isolation, quarantine, social distancing and community containment: pivotal role for old-style public health measures in the novel coronavirus ( -ncov) outbreak pathological findings of covid- associated with acute respiratory distress syndrome the deadly coronaviruses: the sars pandemic and the novel coronavirus epidemic in china moderate mutation rate in the sars coronavirus genome and its implications mixed effects models and extensions in ecology with r we are grateful to Álvaro justen and his collaborators for compiling the daily records on covid- cases for brazilian cities, and sara mortara, andrea sánchez-tapia, karlo martins and gustavo paterno for developing r packages to facilitate obtention of covid- data from the brazilian ministry of health. we thank the biodiversity research program (ppbio brasil) of the ministry of science, technology, innovation and communication (mctic) of brazil for providing the contact network that enabled quick collaboration among the researchers involved in this manuscript. we also thank alexandre almeida, daniel pimenta and lucas bandeira for useful suggestions on preliminary analyses, and elizabeth franklin, daniela bôlla, fabíola wieckert, sérvio ribeiro and two anonymous reviewers for useful comments on earlier versions of this manuscript. the following grant information was disclosed by the authors: ministry of science, technology, innovation and communication (mctic). cnpq. pedro pequeno conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. bruna mendel conceived and designed the experiments, performed the experiments, authored or reviewed drafts of the paper, and approved the final draft. clarissa rosa conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft. mariane bosholn conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft. jorge luiz souza conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft. fabricio baccaro conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft. reinaldo barbosa conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft. william magnusson conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft. the following information was supplied regarding data availability:the raw measurements are available in the supplemental files. supplemental information for this article can be found online at http://dx.doi.org/ . / peerj. #supplemental-information. the authors declare that they have no competing interests. key: cord- -ujh nmh authors: ben miled, s.; kebir, a. title: simulations of the spread of covid- and control policies in tunisia date: - - journal: nan doi: . / . . . sha: doc_id: cord_uid: ujh nmh we develop and analyze in this work an epidemiological model for covid- using tunisian data. our aims are first to evaluate tunisian control policies for covid- and secondly to understand the effect of different screening, quarantine and containment strategies and the rule of the asymptomatic patients on the spread of the virus in the tunisian population. with this work, we show that tunisian control policies are efficient in screening infected and asymptomatic individuals and that if containment and curfew are maintained the epidemic will be quickly contained. on march , , who announced that the covid- epidemic had passed the pandemic stage, indicating its autonomous spread over several continents. since march , tunisia has experienced a turning point and general health containment has begun. tunisia's strategy of containment and targeted screening corresponds to the first who guidelines, the aim being to detect clusters by diagnosing only suspicious persons and then to trace the people who came into contact with the positive cases. this method is now showing its limitations. the mass screening carried out in some countries shows that asymptomatic patients or those who develop only a mild form of the disease may exist in significant numbers. so what is the rule of the asymptomatic patients on the spread of the virus in the tunisian population and does containment and mass screening strategies are sufficient to control the spread of the virus in the tunisian population? in this work, a mathematical epidemiological model for covid- is developed to study and predict the effect of different screening, quarantine, and containment strategies on the spread of the virus in the tunisian population. this model is more detailed than the classical model (sir) but it remains very simple in its structure. indeed, all individuals are assumed to react on average in the same way to the infection, there are in what follows, we present the model and its assumptions. then we calibrate different parameters of the model based on the tunisian data and calculate the expression of the basic reproduction number r as a function of the model parameters. finally, we carry out simulations of interventions and compare different strategies for suppressing and controlling the epidemic. covid- is a respiratory disease that spreads mainly through the respiratory droplets expelled by people who cough. so the transmission is usually direct from person to person. infection is considered possible even when in contact with a person with mild symptoms. in fact, in the early stages of the disease, many people with the disease have only mild symptoms. it is, therefore, possible to contract covid- through contact with a person who does not feel sick. subsequently, in this work, we consider susceptible individuals, noted s, who are infected first go through a stage where they are infected but asymptomatic, noted as for unreported asymptomatic infectious. this stage appears to be particularly important in the spread of covid- . the individuals then develop symptoms and become symptomatic infectious, so either enter directly into a quarantine stage, noted q, corresponds to reported symptomatic infectious individuals, or go through a moderate or severe infectious stage, noted i for unreported symptomatic infectious and then can go through the quarantine stage or not. finally, the infection ends and the individuals are then immunized, denoted r or dead, denoted d. this life cycle can be represented using the following flow chart ( ) followed by table that lists the model parameters. the quarantine is assumed to act on the as and i stages. indeed, we assumed that the state can detect asymptomatic individuals by for example random screening, and then positive testing ones go into quarantine. let τ , respectively τ , be the quarantine rate for as class, respectively i. we assumed that the asymptomatic individual, as, turn out to be i at a rate β. we further assumed that quarantined individuals, q and infected individuals, i either die at a rate of µ per unit of time or become recovered/immune, r, at a rate of γ per unit of time. finally, we assume that each healthy individual is infected proportionally by as asymptomatic individuals, with a rate of α and by i infected individuals, with a rate of α . as i state is constituted with the moderate or severe state, they are more contagious than the as state, therefore, α < α . therefore, our model consists of the following system of ordinary differential equations: with an initial condition at time t = t defined as following: one of the advantages of the basic reproduction number r concept is that it can be calculated from the moment the life cycle of the infectious agent is known. we calculate the r for our model using the next generation theorem [ ] (see section a. ), . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . https://doi.org/ . / . . . doi: medrxiv preprint the first term of the expression of ( ) corresponds to infections generated by asymptomatic types (healthy carrier to mild symptoms). the second term corresponds to secondary infections caused by moderate or severe symptomatic infection. with this expression, we can see that there are several ways to lower the r and thus control the epidemic. for example, we can reduce the number of susceptible people (decreasing α and α ) by confining the population, reducing contacts, and wearing masks. we can also reduce the rate of contact with an infected person by increasing quarantine rates (τ and τ ) by isolating asymptomatic or symptomatic infected persons through mass screening. the estimation of the different parameters of the model is done in three steps (see section a. ). in the first step, we will estimate the start date of the epidemic, t , the initial states as(t ) and i(t ) as well as the infection rates α and α . in the second step, we estimate the mortality rate, µ, and the recovery rate, γ. in the third step, we evaluate the parameters τ and τ by an optimization method. the program is available for download . we used the tunisian health commission data-set of reported data to model the epidemic in tunisia. it represents the daily new-cases, death, and recoveries in tunisia. the first case was detected on march , . to estimate the initial conditions as(t ) and i(t ) and parameters α and α , we fix s = , which corresponds to the total population of tunisia and assume that the variation in s(t) is small during the period consider. we also fix the parameters β, γ, τ and τ . for this estimation, we adapt the method developed by [ ] in our case. let crt) the cumulative number of reported infectious cases at time t, defined by, let's assume that cr(t) = χ exp(χ t) − χ with χ, χ and χ three positive parameters that we estimate using log-linear regression on cases data (see figure and table ). we obtain the model starting time of the epidemic t by assuming that cr(t ) = and therefore equation ( ) implies that: https://github.com/mayaralatrech/covid _sasymodel.git https://covid- .tn/ . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . for now, we assume that α = f α with f a fixed parameter bigger than and let's τ = τ /τ . then, by following using the approach described in the step of section a. , we have: in figure , we plotted the number of individuals in quarantine without curfew predicted by the ode model, q(t) i.e. the number of diagnosed cases per day and compared to the tunisians data. we observe that from the th day onwards, the simulated curve deviates from the observed data. this deviation is due to the epidemic control policies put in place between and march (closure of cafes shops and the introduction of a curfew). . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . figure we study the effect of the curfew installed by the tunisian state since march , , on the number of infected people reported by the state. figure shows the effect of two curfew strategies on the dynamics of the epidemic: a -hour curfew (the chosen tunisian policy) and an -hour curfew (a more restrictive policy). during the period of curfew, the rate of infection α is divided by . in figure (a), it can be seen that, for the chosen policy to maintains a -hour curfew for days, the epidemiological peak in terms of the number of reported infected is reached after about days with a value equal to . after the peak, we observe a slow decrease in the number of reported infected persons. on the other hand, in the more restrictive case of hours curfew, the peak would be reached more quickly after days with a more rapid decrease. these values should be compared with the cases given by [ ] and the fact that the epidemiological peak was reached around april , , after about days with a reported infected number equal to . we represent on figure (b) , the number of deaths by time, it appeared that the peak of the deaths is shifted to the peak of the infected for about days, this shift corresponds to the hypothesis that we made on the duration between the beginning of the symptoms and the deaths ( i.e. days). we note that the simulated death curve overestimates the observed curve. this may be due to the mortality of unreported infected and therefore the surplus corresponds to unreported covid- mortality, which makes the optimization of the model's variables for deaths imprecise. moreover, the model predicts deaths at the end of the epidemic in the current case ( hours curfew). in the case where the curfew was hours, the number of deaths would be . this information should be taken with caution, because at the time the simulations were made the number of deaths was low. finally, we notice that the ratio between the undeclared cases (asymptomatic and symptomatic) represents between % at the beginning of the epidemic for less than % at the end of the epidemic (see figure ). we study in this section the effect of more intensive mass screening, i.e. by increasing τ and τ , on the basic reproduction number, r , and on the number of declared infected, q (see figures , ). in figure , we can see that no matter how intensively we screen, we have r > . moreover, we note that to minimize the basic reproduction number, it is necessary to increase the massive screening of the class of . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . https://doi.org/ . / . . . doi: medrxiv preprint asymptomatic infections. similarly in figure , we can see that the increase in the mass screening effort allows a more rapid decrease in the number of cases (see figure (a) ). this is probably since cases are detected earlier, they do not contribute to the contamination of the healthy ones. indeed, mass screening has an indirect effect on recruitment in the infect compartment. more specifically, we assume that β + τ does not vary when τ changes. consequently, the effort of mass screening on asymptomatic patients cannot exceed this value, and then any additional effort beyond β + τ will be passed on to healthy patients and therefore will be useless. it is observed that the calibration of the model on the tunisian data using metropolis-hastings (mh) algorithm, gives a value of τ = . , which represents a very important screening effort. this would prove that the screening strategy is very efficient. however, we didn't had access to the testing campaign methodology that would have allowed us to adjust our estimates. moreover, it is observed that the number of deaths at the end of the epidemic varies from to , i.e. a % decrease in the number of cases (see figure (b)). it can be noted that since april, , , tunisia has succeeded in slowing down the speed of propagation thanks to and containment and a curfew. with this work, we suggest that if containment and curfew are maintained, short-term projections could be more optimistic. the fact that the epidemic is quickly contained tends to show that the number of undeclared infected is low, which may suggest that our model is efficient for the evaluation of undeclared cases. in fact, we show that at the time of the epidemiological peak, the number of unreported infected persons constitutes at most / of the infected population. however, this will need to be confirmed by a field evaluation. moreover, using tunisian data, the optimization algorithm fixes the rate at which asymptomatic enter in . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . https://doi.org/ . / . . . doi: medrxiv preprint quarantine, τ , at a high value. this expresses the good performance of the control policy of the tunisian government. indeed, in tunisia, the control policy consists of an intense isolation campaign targeting sick individuals and their relatives. an effort of testing was carried out in a targeted manner, similar to snowball sampling. this approach enabled to have a major screening effort on infected and asymptomatic individuals. finally, the model was successfully able to predict the time of the peak at the end of april. a. computation of the basic reproduction number r we use the next generation matrix to drive the basic reproduction number r [ ] . in the system ( ) we have two infected compartments represented by the second and third equations of the system. therefore, at the infection-free steady state, i.e. for a small (as, i) and s = s , the linear epidemic subsystem is : if we set x = (as, i) t as the vector of infected, f is the matrix that represents the production of new infections and t the matrix of transfer into and out of the compartment by transmission, mortality, quarantine, and recovery, then the matrix form of the linear epidemic subsystem is: therefore, the next generation matrix is : knowing that the basic reproductive number r is the largest eigenvalue of the next-generation matrix, then: step : in this part, to estimate the initial conditions as(t ) and i(t ) and parameters α and α we adapt the method developed by [ ] . let . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . https://doi.org/ . / . . . doi: medrxiv preprint γ = γ + µ and cr(t) the cumulative number of reported infectious cases at time t, defined by, let's assume that cr(t) = χ exp(χ t) − χ with χ, χ and χ three positive parameters. by assuming that, cr(t ) = equation ( ) implies that: and then t = χ (ln(χ ) − ln(χ )). using equation ( ), we have also: let's note τ = τ τ and h(t) = as(t) + τ i(t). then we have, in order to simplify the calculus, we will use the normalized functions, as h and i h . we have: rewriting the third equation ( ), with h variable, di dt = βh − (τ + βτ + γ )i by assuming, that i(t) = i(t ) exp(χ (t − t )) and substituting in equation ( ), we obtain: χ i(t ) = βh(t ) − (τ + βτ + γ )i(t ) equation ( ), implies by using equation ( ) and ( ), we obtain: let's assume that α = f α with f a fixed parameter bigger than . the parameter α is evaluated using as(t) = as(t ) exp(χ (t − t )) and the second equation of ( ) at t , we obtain: χ as(t ) h(t ) = α s (f i(t ) h(t ) + as(t ) h(t ) ) − (β + τ ) as(t ) h(t ) ⇔ ( ) α = χ + β + τ (f i(t ) as(t ) + )s ( ) . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not certified by peer review) the copyright holder for this preprint this version posted may , . . https://doi.org/ . / . . . doi: medrxiv preprint and therefore using equations ( ) and ( ), α = (χ + β + τ ) ( f β χ +τ +γ + )s ( ) step we hereby propose to estimate γ and µ. we notice that, r(t) = γ µ d(t), for all t > . let ρ = µ γ , ρ is estimate using dead and recoveries data. let p the fraction of infectious (quarantined or not) that become reported dead ( i.e. − p become reported recovered). thus ρ = pμ ( −p)γ , with /μ the average time to death and /γ the average time to recover. therefore, step parameters τ et τ was estimated using metropolis-hastings (mh) algorithm developed in the pymcmcst python package [ ] estimation of tunisia covid- infected cases based on mortality rate infectious diseases of humans: dynamics and control on the definition and the computation of the basic reproduction ratio r in models for infectious diseases in heterogeneous populations understanding unreported cases in the covid- epidemic outbreak in wuhan, china, and the importance of major public health interventions a python package for bayesian inference using delayed rejection adaptive metropolis key: cord- -ncz b dl authors: caldwell, allen; hafych, vasyl; schulz, oliver; shtembari, lolian title: infections and identified cases of covid- from random testing data date: - - journal: nan doi: nan sha: doc_id: cord_uid: ncz b dl there are many hard-to-reconcile numbers circulating concerning covid- . using reports from random testing, the fatality ratio per infection is evaluated and used to extract further information on the actual fraction of infections and the success of their identification for different countries. there are many hard-to-reconcile numbers circulating concerning the new corona virus. a wide range for the fraction of the population with positive tests for covid- has been reported, with orders of magnitude differences in the case fatality ratio (cfr) in different countries. getting a better understanding of the fraction of the population that has been infected with covid- and of its lethality is of utmost importance for guiding further actions. this paper presents an analysis of publicly available random testing data and is aimed at providing information on the number of infections and the lethality of the novel corona virus. this is quantified as an infected fatality ratio (ifr). we perform this study as an exercise in data analysis, without attempting to interpret or modify reported numbers. the data that we use is the data we found up until may , . one use of the analysis carried out in this note is to evaluate the probability that infected individuals were identified. this probability is important to gauge the effectiveness of proposed contact tracing measures to control the spread of the virus. we use the extracted ifr results from the random testing data to evaluate this probability in a large sample of countries and find a wide variation in the reported fraction of infected individuals. the data from random testing that is available in some form to-date (in several cases as reports in newspaper articles) are employed to address these various questions. this is not a fine-grained analysis -only the extraction of probable ranges for our quantities of interest for large populations is attempted. it is also quite preliminary, as there is limited random testing data available and the level at which this data is truly representative of the broader population is not clear. nevertheless, we find that the reported data are largely self-consistent and therefore offer some guidance. the data considered for the extraction of the ifr of covid- is the available data as of may , from random tests for antibodies. we assume that the presence of antibodies in previously infected people is ascertainable with high probability three weeks after infection date: may , . (typical numbers reported are about %). we then take for our estimate of the ifr the number of deaths approximately one week after the reported dates of the random testing for antibodies. the choice of using one week is based on the estimate that the time lag between developing severe symptoms and death is on average approximately days (see e.g. [ ] ) and the assumption that the development of significant antibodies occurs on average some time after the development of the symptoms. our results will depend on the correctness of this assumption, and we estimate how our results could have changed if we had picked a different time lag in the discussion on systematic uncertainties in section . random testing has also been performed to detect the presence of the virus (see [ , , , ] ), but these data are more difficult to use in extracting the ifr. the reason is that the detection of the virus is limited to a time window when the virus is present in the location being tested (throat, lungs) and that only the deaths related to this time period should then be used to evaluate the ifr. we leave this analysis to a later study. seven random testing samples for antibodies were found, three of which have been well documented . these data sets are given in table and are the basis for this analysis. it is intended that this analysis will be updated as improved data becomes available. table . data used in the analysis (see text for references). the symbol n is used to represent the number of people tested for antibodies, while k is the number of positive results. the numbers on population are from readily accessible web resources. the numbers of deaths are from either from the reports summarizing the studies or from web data made available by the local authorities. santa clara chelsea kreis heinsberg la county new york miami geneva . . data treatment. the data were taken at face value where possible. in several cases, further analysis of the reported numbers were discussed in the publications leading to revisions of the raw numbers. we do not apply these corrections here but note the size of the effect for reference and discuss these in the section on systematic uncertainties later in the paper. in some cases, estimates were necessary since not all required information was available. the following should be taken into account: • the raw number of positive tests in santa clara was out of [ ] , which is about . %. the authors of the study corrected the observed fraction to account for locality, sex and ethnicity and found a result of . %; i.e., a factor of two higher than we have recently become aware of a study undertaken in spain that would result in a much higher ifr value than the values we find. at the moment, we do not have the information needed but the results can be added to our study once a more complete sets of numbers are reported. the raw observed value. this would lead to a lower ifr if this value was used. this correction was not taken into account in the central values extracted for this analysis but were considered as a systematic variation. the tests were performed on april , . the number of deaths one week later was , while deaths had been recorded two weeks after performing the tests. it is relevant to note that the infected fraction, being quite low, is very sensitive to the specificity of the test, which can introduce a large systematic uncertainty. • the dates at which the tests were performed in chelsea, massachusetts [ ] could not be found but are assumed to be in the first week of april. the tests excluded patients who had tested positive for the virus from nasal swabs. of the people tested, tested positive for antibodies. this large fraction of positive tests relaxes the uncertainty due to the specificity. the number of deaths was taken on april th. this could possibly be too late, resulting in an overestimate of the ifr. • the kreis-heinsberg results were taken from the latest values reported in [ ] . the 'effective' number of positive tests was taken as . % of the number of individuals tested, as this is the fraction of positive tests reported in the reference. we note that this value is the result of a detailed analysis by the authors. the number of deaths in gangelt, the town in which the tests were performed, was at the time of the tests and increased by in a two week follow up period. using this larger number of deaths would result in a % increase in the ifr value from this particular study. however, since the numbers are quite small this change is already accounted for in the large uncertainties associated with our extracted ifr (see next section). • the number of positive tests in los angeles county were estimated based on the reported fraction of positive tests in [ ], where . % of the tested cases were reported to have the antibody to the virus. we estimated the number of positive tests based on this value. the tests were performed on april , . this study is purported to represent all of los angeles county, which had reported deaths on april th and ten days later. for our analysis, we took the number of deaths to be , corresponding to the approximate value one week after the tests. • two different sets of numbers have been reported for new york -one for new york city and one for new york state. here the statewide numbers are taken. the number of positive tests in new york state was estimated based on the reported fraction of positive tests in [ ] . the testing was started on april and the fraction of positive tests for the antibodies was reported as 'nearly . %' from tests. the number chosen for the analysis was positive cases. the number of deaths on april th was approximately . the number of deaths around this date was increasing by about per day, or % per day. this data is the most statistically important in the study. as we will see, it also results in the highest ifr value of the data sets. • the number of positive tests in miami (dade county) was estimated based on the reported fraction of positive tests in [ ] . there, it is said that 'nearly individuals have participated in the program' and ' %' tested positive for antibodies. the tests were carried out in the time window from april - , and the number of deaths reported as due to covid- is taken from april th -i.e., it is likely an overestimate of the ifr for the miami area. • in studies conducted in geneva, random samples of the population were tested on consecutive weeks and early results of the study have been made available [ ] . we use the results for the testing performed in the third week (the week of april , ), where individuals were tested, resulting in positive tests. as of april , deaths had been reported for a population of nearly , . the specificity of the antibody tests was very high, with no false positives in a sample of known negative cases, and a true positive rate of . %. the authors used these values to evaluate a somewhat higher infected fraction than seen in the raw numbers ( . % versus . %). as in the other cases, we do not correct for this difference. the basic assumption made in this article is that, for a large population base and for populations with similar levels of medical care, the ifr from covid- should not differ widely once the population has been infected across all classes of individuals evenly (e.g., across age groups). there are clearly very significant differences in ifr amongst age groups, and many other factors play crucial roles in individual cases such as medical preconditions or the availability of top-rate medical services. it is assumed that these will tend to average out for large (say , or more) population groups in many countries so that it is meaningful to talk about the average lethality of the disease for these areas. this averaging has likely not yet occured in many areas as we are still in the relatively early days of the pandemic, so that our results will not be truly representative in individual cases. however, it should be the case that by taking a number of different studies these effects tend to average out. more discussion on the biases that can result can be found in section . given our basic assumption, it becomes possible to track the average number of infected people based on the number of recorded deaths due to covid- . to arrive at the population averaged ifr results, the data in table were analyzed as follows. first, the probability of a positive test in a random sample was taken to follow a binomial distribution: where f represents the probability that a person in the area of interest was infected, n is the number of random tests performed, and k is the number of positive tests. the value of f clearly varies widely across the different areas as seen in table . we note that the value of k also depends on the efficiency of the test in positive cases, and can also be overestimated by false positive cases (specificity of the tests). we do not attempt to correct for these effects and assume here that they do not significantly bias the results. expert knowledge in these issues would be needed for this, which we do not possess. we discuss the possible variations in the results in section . the probability of having a number of deaths d given a population size p is then taken to follow a poisson probability distribution as the ifr is known to be a small number. we have that the expected number of infected people in the sample is: and the expected number of deaths due to covid- from this number of infected people will be (after the appropriate lag time): where here r is the ifr value for the particular sample. we note again that the number of deaths is counted significantly later than the number of infected people (typically days later in our analysis). the probability of observing d deaths is then: in terms of the parameters that we use in the analysis, we write this as: the number of reported deaths due to covid- is subject to different definitions and should also be considered as uncertain. again, we do not attempt a correction for these variations as we do not have the competence for this. possible variations in our results due to this uncertainty in section . . . probability distribution for the ifr. to implement the condition that the ifr should not differ too broadly in different areas, the individual ifr values, r i in the different regions were assumed to come from a 'parent distribution' that is meant to represent the range of possible ifr results. this parent distribution should clearly have zero probability for an ifr value of zero, and cannot be larger than a few % based on known results. we therefore choose distributions that can represent this behavior. our primary results are derived using a log-normal distribution, and the weibull distribution is used as an alternative to test the systematic uncertainties pertaining to our choice. it is important to realize that the distribution that we extract for the ifr is the main result of the analysis. i.e., we are not expecting a single value for the ifr in all regions, but expect that all values are within roughly a factor - of each other, with variations depending on the particular conditions in the area under consideration. the extracted distribution should represent in some approximate way the distribution of ifr across a wide range of conditions. once we extract the parameters of this distribution, we then use it to define a range of infections for different countries in the next section. the log-normal probability distribution that is used to describe the ifr probability in different regions is defined as: this distribution introduces two parameters in the analysis, µ and σ, that control the log-normal distribution. the parameter µ is the mean of the log-normal distribution while σ controls the width of the distribution. the possible shapes of the log-normal distribution considered in our analysis are shown in fig. . the analysis is carried out using bayes theorem to yield the probability distributions on the parameters of interest: where the bold-faced symbols represent vectors of numbers. the analysis was carried out using the bayesian analysis toolkit (see [ ] ). the individual terms in our expression are given as follows: we have that . ifr distribution from the random testing data. the log-normal distributions for the extremes of the parameter ranges allowed in the analysis are shown in the dotted and dashed curves. all log-normal distributions with shape between these extremes are allowed in the analysis. the shape of the distribution from the fit to the data in table using the most probable values of the probability distributions for µ and σ is shown as the solid curve. where the two terms on the right hand side of the equation are products of terms of the kind given in equations and . the second term in the numerator is: the prior probabilities (those with subscript ) are all taken as flat prior probabilities and p (r|µ, σ) is given by the product of expressions of the form given in eq. . fitting the data in table yields the results shown in fig. and . the shape of the log-normal distribution using the best-fit parameter values is shown in fig. . the best-fit distribution is comfortably between the limits set in the analysis. as is seen in the plot, the peak of the distribution is for an ifr around . %, with a tail out to approximately . %. the ifr values extracted from the random testing data and the total number of deaths recorded one week after the tests therefore lead to the conclusion that the ifr is generally below %. the results from the ifr analysis from the individual studies are displayed in fig. . the overall range of values of ifr from the log-normal distribution is shown together with the individual r i results extracted from the fit. for the individual fits, best fit values are shown together with the % uncertainties as horizontal bands. the median and central % probability range of the ifr from the log-normal distribution are also shown in the plot, and the shaded background indicates the probability density of the ifr distribution. while there is some spread in the results derived from different studies, the range is not so large as that seen in the cfr values, with individual ifr values ranging from approximately . % to . %. in addition to these results, the simple scaling result (deaths resulting from covid- ) divided by (observed positive test fraction multiplied by the population of the area under study) are also shown. as a next step, we use these results to evaluate the infected fractions in different countries and also evaluate the effectiveness of diagnosing positive cases using currently used methods. we now take the results from the ifr analysis and apply it to different countries to evaluate both the degree of infections as well as the effectiveness of recognizing infected and infectious cases. for this, we use the data given in table . the data were taken from [ ] and the population numbers are from [ ]. . . estimating the delay between infection reports and death reports. as a first step, we extract the relationship between the number of recorded deaths due to covid- and the number of recorded infections. the number of recorded deaths follow a similar development as the number of recorded infections, but is clearly delayed in time. this can be seen in fig. where we show the results for germany as an example. the number of deaths can be predicted from the number of infections by forecasting the number of deaths as a fraction of the number of infected cases and distributing these over a number of days. mathematically, we predict the number of deaths on day t as: where i(t) is the number of recorded infections on day t, s is a scale factor to predict the number of deaths (the cfr), ∆ quantifies a shift between the recorded infections and the recorded deaths, and σ d describes how the deaths are distributed and is given in days. note that we use lower case letters (d, i) to represent the daily death and recorded infection values, while the upper case letters are reserved for the integrated quantities. i.e., we have that the symbold(t ) is our estimate for the number of deaths on day t according to our model. a truncated normal distribution is used for p(t |t + ∆, σ d ) to model the spread of the infected cases forward in time. the expression evaluated at date t is in order to extract the relevant parameter values, a χ minimization is performed. the result for germany is shown in fig. and led to the parameters: s = . , ∆ = . days and σ d = . days. recall that s is the cfr, so we have a case fatality ratio of about % for germany. the results for the parameter ∆ are shown in fig. for all countries given in table . the figure shows the extracted probability density for ∆ as the blue shaded band, and also the most probable values as well as one standard deviation intervals. as is seen, the results can be divided roughly into three categories: those countries with very small delay, a set of countries with delay of − days (these are countries where typically a large number of infections were reported) and countries such as germany where the delay is closer to weeks. the latter set of countries typically did not have a surge of infections that stressed the medical resources of the country. we use these values of ∆ to extract the date at which we will report the fraction of the infected population in the following. the comparison of the predicted values of deaths using the expression in eq. and the observed values are shown in fig. for all countries in table . as can be seen in the figure, a reasonable description is achieved in most cases. some countries show spikes in the reported death distributions due to changes in counting procedures. these spikes cannot be modeled by our simple equation and are not real effects but due to re-evaluation of standards for the counting. estimating the fraction of the population infected with covid- . the results from the delay analysis allow us to estimate the fraction of the population that has been infected at a fixed date by using the number of reported deaths and scaling up using the ifr range from the log-normal distribution: is the number of reported deaths at date t . the value of the ifr, r, follows the log-normal distribution whose parameters are extracted from the most likely values obtained from the fit of µ and σ we performed in the previous sections . since the ifr value is taken from a probability distribution, we obtain a probability distribution forÎ, from which we can extract the % central interval limits. the results of the analysis for our selection of countries are reported in table . the median infected fractions range from nearly for china to nearly % for belgium. the % ranges span a factor of . i.e., there is a large uncertainty in the results coming from the fact that the ifr distribution is quite broad. for most countries, we can conclude however that the percentage of the population that has been infected is in the single digits. the limits that we extracted for the infected population of each country were based on the most recent number of deaths. as we have discussed earlier, the number of deaths is related to the number of infected individuals, but it also contains a time delay, quantified by ∆. this means that if we extrapolate the infected population from the current number of deaths, we will obtain an estimate that is representative of the past number of infected people, specifically ∆ days in the past. it is possible to estimateÎ at the current date by using the number of deaths we expect in the future, d(t + ∆). we can estimate the cumulative distribution of d(t + ∆) by propagating forward in time the number of deaths using the fitted model we described above and the current number of reported individuals. substituting these new estimates in equation we obtain: is the number of reported cases at the date (t − ∆), d(t ) is the reported number of deaths on the date t and r is the ifr value from the log-normal distribution. in extracting our estimate for the reported fraction of infections,f r (t ∆ ), we allowed the value of ∆ and r to vary according to the probability distributions from the fit yielding ∆ and from the log-normal distribution, respectively. this leads to a rather broad distribution forf r (t ∆ ). the results of the analysis are summarized in table and shown graphically in fig. . the reported percentages of diagnosed cases are typically also estimated to be in the single digit range. note that the same ifr range was used for all countries. if a given country is believed to have a lower ifr than the average, then for that country the reporting fraction will tend to be smaller, and vice-versa. table that have been infected with covid- . the blue shaded bands indicate the probability density as a function of the estimated infected fraction. the solid green line indicates the median estimate at the date (may , -∆) where the delay is country dependent and is shown in fig. a number of assumptions were made at various points in this analysis, and we estimate how our results vary when we take different assumptions into account. . . false positives in antibody tests. as noted in section , the number of reported infections in the random testing data is modified by false positive results and also by inefficiency of the tests to recognize genuine infections. we have that, for a single test, the probability to have a positive result is where p (+) is the probability of a positive test result, p (+|i) is the probability of a positive test result given that a person was infected, p (+|Ī) is the false positive test result and f is the probability that a person has been infected. turning the expression around, we have the denominator is taken to be close to , so that if the specificity of the test is high, meaning p (+|Ī) is very small, and p (+|i) ≈ , then we can assume that f ≈ p (+). for the santa clara study, we have that p (+) ≈ . . a . % false positive rate would lead to a correction of f downwards by approximately this value, or . %, which would increase the ifr value for santa clara by %. a false positive rate of % would triple the ifr value, bringing it close to the new york result. at this point, we can also recall that the authors of the santa clara study argued that the correct rate is closer to . % than . % due to biases in the random sampling population, so the two corrections would go in opposite directions. for the other studies, the values of the observed positive testing fraction is high enough that false positive rates below about % will not significantly affect the results. other reliability concerns of the random testing. in order to extract an ifr that should represent an average over the full population, it is not sufficient that the random sampling represents the different population subgroups. indeed, it is also important that the distribution of those infected in different population categories (age group, health conditions, ...) is known since there are strong correlations with these aspects of the population. if, e.g., the fraction of those infected in the upper age brackets is disproportionately high in a region related to one of the studies, then the extracted ifr from a random sampling of that population will also be high. this type of biasing must also be taken into account in extracting the final results. some of the studies mentioned (see e.g. [ , ] ) attempt to account for this type of biasing. we have not done so here, so that our results could potentially be biased due to this type of effect. mathematically, we can represent the situation as follows: is the expected number of deaths, p(a, h, o) is the population density as a function of age (a), of health status, modeled here as a continuous quantity (h) and of possibility other relevant characteristics (o). this population density has associated an infected fraction f (a, h, o) that can depend on all of these characteristics. our assumption in the paper is that the random sampling data follows this product of f · p approximately correctly so that we can estimate the ifr result that we use in the analysis is rather broad and allows for factors of - differences and hopefully cover the remaining sampling uncertainties. . . how deaths are counted. there are clearly differences in how deaths due to covid- are counted in different regions. even within a single region, the definition can change as is clearly seen in fig. , where spikes in the death count appear. these uncertainties should be propagated in our analysis. one mitigating factor is that we are using integrated values, such that short term fluctuations are averaged out. however, revised death counts are typically in the upward direction, so an upward shift of the median ifr value (currently below . % from the random testing data) is likely. for the extraction of our probability intervals, we have used the log-normal distribution for the ifr, which has a broad range and a tail to larger values. we believe that this adequately accounts for this expected upward correction in the number of recorded deaths due to covid- . in this analysis, we have used a time delay between the date of the tests for antibodies and the date at which the number of deaths assigned to covid- were counted of one week. this delay was assumed based on modeling data [ ] that indicates on average days between hospitalization and death. it is not clear at which point antibodies will show up in significant amounts in the course of the infection, and the one week estimate in counting the development of antibodies and the number of deaths could be flawed. taking the usa as an example, a one day change in the date at which deaths are counted would change the results by approximately . %. for cases where the infection was developing quickly, the number could be considerably higher. we therefore assume that changes in the number of deaths could certainly be wrong by %. however, the range that we assign to the ifr from the log-normal distribution is much broader than this, so that this uncertainty is not expected to significantly affect the results. . . choice of the log-normal distribution. the log-normal distribution was chosen as the 'parent' distribution due to its well-motivated shape. other functional forms can also have similar shapes, one of which is the weibull distribution: requiring k > leads to a distribution starting at for x = and with a similar shape as the log-normal distribution. extraction the range of ifr values from this distribution led to minimal changes in the results, as can be seen in fig. . table using the most probable values of the parameters for (µ, σ) and (λ, k) are shown. in red the fit for the log-normal distribution and in blue the fit for the weibull distribution. . . constancy of ∆. in order to test the stability of the predicted number of deaths due to different choices of their distribution in the future, we performed our analysis adopting a log-normal distribution instead of a truncated normal in equations . in the case of germany, the results of this comparison are presented in fig. (top). we notice that the two prediction curves are practically indistinguishable from one another. additionally, in order to test the stability of the estimated delay between infection reports and death reports, we investigate evolution of the parameter ∆ by performing an analysis by limiting the time range to a specific day t . the results of this study are shown in fig. (bottom). as we increase t , we notice that our estimate for the mean of ∆ approaches a constant value and the error of this measure, estimated as the standard deviation of ∆, is also reduced in time. an analysis of available random testing data has been performed in order to extract information on the lethality of the covid- virus. assuming this lethality does not vary widely across large population groups, the information was used to extract the fraction of infected individuals in a number of countries and to also estimate the fraction of infected individuals that have been reported as positively identified. a number of caveats apply: • the random test samples used in many the ifr data sets are quite small • the data sets, although quoted as random, are likely biased samples of the population groups targeted and may not represent the distribution of the infected population • the results from this paper rely on the time lag between infection (reporting) and death. the assumptions used may not be accurate. in spite of these caveats, we find that there is good consistency in the extracted values for the infected fatality ratio, ifr, across the different random data sets. we use the extracted ifr to estimate the fraction of the population that has been infected, and the fraction of those infected that have been positively identified. we find a broad range of results for a sample of countries. for the infected fraction, the typical percentages are in the single digits. the reporting fraction tends to be closer to %. the results presented in this simple analysis contain both positive and negative messages. the positive message is that the most likely value of the ifr is below %. it should however be remarked that the value has been extracted from wealthy countries with state-of-the-art medical care. it is quite possible that higher rates will be found in other regions. however, the number of cases that have been reported is a small (typically less than %) fraction of the infected population and this will make contact tracing more difficult. indeed, a higher identification fraction would be necessary for maximum effectiveness of contact tracing (see [ ] ). however, there is no need to assume that contact tracing alone will be used to slow or stop the epidemic, and it will form just one of the approaches to limiting the spread of covid- and further novel viruses. this is clearly foreseen and effective measures such as wearing face masks and adhering to more social distancing will also surely bring a significant positive effect. finally, it should be clear that there are still very large uncertainties on the parameters extracted in this analysis. a much greater level of random testing will make the parameter inference much more reliable and is strongly supported! we would like to thank a number of colleagues for informative discussions. in particular, the following colleagues from the technical university, munich significantly helped in our evaluations: elisa resconi, stefan schoenert, ulrike protzer and rudi zagst. we would also like to thank richard nisius for taking a critical look at an early version of this manuscript and for his valuable comments, and xiaoyue li, olaf behnke and michele giannelli for valualable discussions. the bayesian analysis toolkit repository key: cord- -nhcrbnfu authors: vollmer, robin title: understanding the dynamics of covid- date: - - journal: am j clin pathol doi: . /ajcp/aqaa sha: doc_id: cord_uid: nhcrbnfu nan understanding the dynamics of an epidemic such as that caused by coronavirus disease (covid- ) requires some mathematics and some data. for example, in a population of n persons let y symbolize the number of infected persons and x symbolize the number of susceptible persons, then the change in the number of infected persons per unit time is given as: here, c symbolizes the rate of contact, p the transmission probability, and v the recovery rate (ie, -the case fatality rate). by letting x = n -y, the equation can be rewritten in the form: with a = c*p -v and with b = (a + v)/n. the solution to this differential equation is : with y( ) denoting the number infected at time zero and y(t) the number infected at time t. in what follows i illustrate this approach with data reported by jondavid klipp (jondavid@laboratoryecomics. ccsen.com) regarding the cruise ship diamond princess, which took on a single infected passenger from hong kong. subsequently, all on board were tested for covid- , and of the , persons on board tested positive after being forced to remain on board for a month. ten died. thus, values for n, y, x, t, and v were known. using a value of approximately . for the product c*p, the prediction for y at days was , which equaled that observed, and ❚figure ❚ shows a plot of predicted values of y over a course of days (the point is for the time of days when the passengers left the ship). the logic used in equation clearly implies that the product c*p relates closely to how fast a population becomes infected, and this in turn depends on a number of factors including patient ages, comorbidities, geographical population densities and, of course, the virus. the value of approximately . for c*p in the diamond princess population may be high for covid- , because the geography of this population was so restricted and because many passengers were older. only follow-up studies as the epidemic matures will provide additional estimates of c*p. the diamond princess data also suggest that many cases are asymptomatic. thus, the widely reported numbers of new cases could omit cases who were asymptomatic and therefore not tested. this in turn could yield higher case fatality rates and therefore smaller estimates of the response rate, v, than are realistic. regardless, the success for the logistic growth model applied here to covid- suggests that for many populations the number of infected may eventually reach a limit. ❚figure ❚ plot of the predicted number of infected persons vs days on board the diamond princess. the predictions come from equation . concepts of infectious disease epidemiology differential equations and their applications key: cord- -q de r p authors: griette, p.; magal, p. title: clarifying predictions for covid- from testing data: the example of new-york state date: - - journal: nan doi: . / . . . sha: doc_id: cord_uid: q de r p in this article, we use testing data as an input of a new epidemic model. we get nice a concordance between the best fit the model to the reported cases data for new-york state. we also get a good concordance of the testing dynamic and the epidemic's dynamic in the cumulative cases. finally, we can investigate the effect of multiplying the number of tests by , , , and to investigate the consequences on the reduction of the number of reported cases. the epidemic of novel coronavirus infections began in china in december and rapidly spread worldwide in . since the early beginning of the epidemic, mathematicians and epidemiologists have developed models to analyze the data and characterize the spread of the virus, and attempt to project the future evolution of the epidemic. many of those models are based on the sir or seir model which is classical in the context of epidemics. we refer to [ , ] for the earliest article devoted to such a question and we refer to [ , - , , , , , ] for a rather complete overview on sir and seir models in general. in the course of the covid- outbreak, it became clear for the scientific community that covert cases (asymptomatic or unreported infectious case) play an important role. an early description of an asymptomatic transmission in germany was reported by rothe et al. [ ] . it was also observed on the diamond princess cruise ship in yokohama in japan by mizumoto et al. [ ] that many of the passengers were tested positive to the virus, but never presented any symptoms. we also refer to qiu [ ] for more information about this problem. at the early stage of the covid- outbreak, a new class of epidemic models was proposed in liu et al. [ ] to take into account the contamination of susceptible individuals by contact with unreported infectious. actually, this class of model was presented earlier in arino et al. [ ] . in [ ] a new method to use the number of reported in sir models was also proposed. this method and model was extended in several directions by the same group in [ ] [ ] [ ] to include non-constant transmission rates and a period of exposure. more recently the method was extended and successfully applied to a japanese age-structured dataset in [ ] . the method was also extended to investigate the predictability of the outbreak in several countries china, south korea, italy, france, germany and the united kingdom in [ ] . the application of the bayesian method was also considered in [ ] . in parallel with these modeling ideas, bayesian methods have been widely used to identify the parameters in the models used for the covid- pandemic (see e.g. roques et al. [ , ] where an estimate of the fatality ratio has been developed). a remarkable feature of those methods is to provide mechanisms to correct some of the known biases in the observation of cases, such as the daily number of tests. here we will embed the data for the daily number of tests into an epidemic model, and we will compare the number of reported cases produced by the model and the data. our goal is to understand the relationship between the data for the daily number of tests (which will be an input our model) and the data for the daily number of reported cases (which will be an output for our model). the plan of the paper is the following. in section , we will present a model involving the daily number of tests. in section , we apply the method presented in [ ] to our new model. in section , we present some numerical simulations, and we compare the model with the data. the last section is devoted to the discussion. let n(t) be the number of tests per unit of time. throughout this paper, we use one day as the unit of time. therefore n(t) can be regarded as the daily number of tests at time t. the function n(t) is actually coming from a database for the new-york state [ ] . let n (t) be the cumulative number of tests from the beginning of the epidemic then n (t) = n(t), for t ≥ t and n (t ) = n . ( . ) remark . section is devoted numerical simulations. we will use n(t) as a piecewise constant function that varies day by day. each day, n(t) will be equal to the number of tests that were performed that day. so n(t) should be understood as the black curve in figure . the model consists of the following system of ordinary differential equations this system is supplemented by initial data (which are all non negative) thereby assuming that the disease was introduced by an individual incubating the disease at some time before t . the time t corresponds to the time where the tests started to be used constantly. therefore the epidemic started before t . here t ≥ t is the time in days. s(t) is the number of individuals susceptible to infection. e(t) is the number of exposed individuals (i.e. who are incubating the disease but not infectious). i(t) is the number of individuals incubating the disease, but already infectious. u (t) is the number of undetected infectious individuals (i.e. who are expressing mild or no symptoms), and the infectious that have been tested with a false negative result, are therefore not candidates for testing. d(t) is the number of individuals who express severe symptoms and are candidates for testing. r(t) is the number of individuals who have been tested positive to the disease. the flux diagram of our model is presented in figure . susceptible individuals s(t) become infected by contact with an infectious individual i(t), u (t) or d(t). when they get infected, susceptible are first classified as exposed individuals e(t), that is to say that they are incubating the disease but not yet infectious. the average length of this exposed period (or noninfectious incubation period) is /α days. after the exposure period, individuals are becoming asymptomatic infectious i(t). the average length of the asymptomatic infectious period is /ν days. after this period, individuals are becoming either mildly symptomatic individuals u (t) or individuals with severe symptoms d(t). the average length of this infectious period is /η days. some of the u -individuals may show no symptoms at all. in our model, the transmission can occur between a s-individual and an i-, u -or r-individual. transmissions of sars-cov- are described in the model by the term τ s(t)[i(t) + u (t) + d(t)] where τ is the transmission rate. here, even though a transmission from r-individuals to a s-individuals is possible in theory (e.g. if a tested patient infects its medical doctor), we consider that such a case is rare and we neglect it. : key time periods of covid- infection: the latent or exposed period before the onset of symptoms and transmissibility, the incubation period before symptoms appear, the symptomatic period, and the transmissibility period, which may overlap the asymptomatic period. the last part of the model is devoted to the testing. the parameter σ is the fraction of true positive tests and ( − σ) is the fraction of false negative tests. the quantity σ has been estimated at σ = . in the case of nasal or pharyngeal swabs for sars-cov- [ ] . among the detectable infectious, we assume that only a fraction g are tested per unit of time. this fraction corresponds to individuals with symptoms suggesting a potential infection to sars-cov- . the fraction g is the frequency of testable individuals in the population of new-york state. we can rewrite g as where p is the total number of individuals in the population of the state of new-york and ≤ κ ≤ is the fraction total population with mild or sever symptoms that may induce a test. individuals who were tested positive r(t) are infectious on average during a period of /η days. but we assume that they become immediately isolated and they do not contribute to the epidemic anymore. in this model we focus on the testing of the d-individuals. the quantity n(t) σ g d is flux of successfully tested dindividuals which become r-individuals. the flux of tested d-individuals which are false negatives is n(t) ( − σ) g d which go from the class of d-individuals to the u -individuals. the parameters of the model and the initial conditions of the model are listed in table . before describing our method we need to introduce a few useful identities. the cumulative number of reported cases is obtained by using the following equation ( . ) all rights reserved. no reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in the copyright holder for this this version posted october , . . https://doi.org/ . / . . . doi: medrxiv preprint the daily number of reported cases dr (t) is given by ( . ) the cumulative number of detectable cases is given by and the cumulative number of undetectable cases is given by time (in days) number of reported (tested infectious) cases at time t cumulative number of reported (tested infectious) cases at time t daily number of reported (tested infectious) cases at time t cumulative number of undetectable infectious at time t ( . ) in order to deal with data, we need to understand how to set the parameters as well as some components of the initial conditions. in order to do so, we extend the method presented first in [ ] . the main novelty here will concern the cumulative number of tests which is assumed to grow linearly at the beginning. this property is satisfied for the new-york state data as we can see in perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in the copyright holder for this this version posted october , . . this means that we can find a pair of numbers a and b such that where a the daily number of tests and n is the cumulative number of tests on day t . by using the fact that n (t) = n(t) we deduce that april. figure shows that the linear growth assumption is reasonable for the new-york state cumulative testing data. phenomenological models for the reported cases: at the early stage of the epidemic, we assume that all the infected components of the system grow exponentially while the number of susceptible remains unchanged during a relatively short period of time t ∈ [t , t ]. therefore, we will assume that we deduce that the cumulative number of reported satisfies hence by replacing d(t) by the exponential formula ( . ) and it is makes sense to assume that cr(t) − cr(t ) has the following form by identifying ( . ) and ( . ) we deduce that all rights reserved. no reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in the copyright holder for this this version posted october , . . https://doi.org/ . / . . . doi: medrxiv preprint by using ( . ) we obtain ( . ) finally by using ( . ) d = χ χ σ a g . ( . ) and by using ( . ) we obtain ( . ) we assume that the transmission coefficient takes the form where τ > is the initial transmission coefficient, t m > is the time at which the social distancing starts in the population, µ > is serving to modulate the speed at which this social distancing is taking place. to take into account the effect of social distancing and public measures, we assume that the transmission coefficient τ (t) can be modulated by γ. indeed by the closing of schools and non-essential shops and by imposing social distancing the population of the new-york state, the number of contacts per day is reduced. this effect was visible on the news during the first wave of the covid- epidemic in new-york city since the streets were almost empty at some point. the parameter γ > is the percentage of the number of transmissions that remain after a transition period (depending on µ), compared to a normal situation. a similar non-constant transmission rate was considered by chowell et al. [ ] . in figure we consider a constant transmission rate τ (t) ≡ τ which corresponds to γ = in ( . ). in order to evaluate the distance between the model and the data, we compare the distance between the cumulative number of cases cr produced by the model and the data (see the orange dots and orange curve in figure -(a) ). in figure -(c) we can observe that the cumulative number of cases increases up more than millions of people, which indeed is not realistic. nevertheless by choosing the parameter g = . × − = / s in figure -(d) we can see that the orange dots and the blue curve match very well. all rights reserved. no reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in the copyright holder for this this version posted october , . we plot the number of cases obtained from the model. we can observe that most of the cases are unreported. in figure (d) we plot the daily number of tests (black dots), the daily number of positive cases (red dots) for the state of new-york and the daily number of cases dd(t) obtained from the data. in the rest of this section, we focus on the model with confinement (or social distancing) measures. we assume that such social distancing measures have a strong impact on the transmission rate by assuming that γ = . < . it means that only % of the transmissions will remain after a transition period. in figure -(c) we can observe that the cumulative number of cases increases up to (blue curve) while the cumulative number reported cases goes up to . in figure -(d) we can see that the orange dots and the blue curve match very well again. in order to get this fit we fix the parameter g = − . in figure (a) and (b), we aim at understanding the connection between the daily fluctuations of the number of reported cases (epidemic dynamic) and the daily number of tests (testing dynamics). the combination of both the testing dynamics and the infection dynamics gives indeed a very complex curve parametrized by the time. it seems that the only reasonable comparison that we can make is between the cumulative number of reported cases and the cumulative number of tests. in figure in figure , all the curves are time dependent parametrized curves. the abscissa is the number of tests (horizontal axis) and the ordinate is the number of reported cases (vertical axis). it corresponds (with our notations) to the parametric functions t → (n data (t), dr(t)) in figures (a) and (b) and their cumulative equivalent t → (n data (t), cr(t)) in figure (c) and (d). in figures (a) and (c) we use only the data, that is to say that we plot t → (n data (t), dr data (t)) and t → (n data (t), cr data (t)). in figures (b) and (d) we use only the model for the number of reported cases, that is to say that we plot t → (n data (t), dr model (t)) and t → (n data (t), cr model (t)). all rights reserved. no reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in the copyright holder for this this version posted october , . figure . in figure (a) we plot the daily number of cases coming from the data as a function of the daily number of tests. in figure (b) we plot the daily number of cases given by the model as a function of the cumulative number of cases coming from the data. in figure (c) we plot the cumulative number of cases coming from the data as a function of the cumulative number of tests. in figure (d) we plot the cumulative number of cases coming from the model as a function of the cumulative number of tests from the data. in figure , our goal is to investigate the effect of a change in the testing policy in the new-york state. we are particularly interested in estimating the effect of an increase of the number of tests on the epidemic. indeed people commonly say that increasing the number of tests will be beneficial to reduce the number of cases. so here, we try to quantify this idea by using our model. in figure , we replace the daily number of tests n data (t) (coming from the data for new-york's state) in the model by either × n data (t), × n data (t), × n data (t) or × n data (t). as expected, an increase of the number of tests is helping to reduce the number of cases. however, after increasing times the number of tests, there is no significant difference (in the number of reported) between times and times more tests. therefore there must be an optimum between increasing the number of tests (which costs money and other limited resources) and being efficient to slow down the epidemic. all rights reserved. no reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in parameter values: are the same as in figure . in figure (a) we plot the cumulated number of cases cr(t) as a function of time. in figure (b) we plot the cumulative number of undetectable cases cu (t) as a function of time. in figure (c) we plot the cumulative number of cases (including covert cases) cd(t) as a function of time. note that the total number of cases (including covert cases) is reduced by % when the number of tests is multiplied by . in this article, we propose a new epidemic model involving the daily number of tests as an input of the model. the model itself is extending our previous models presented in [ , [ ] [ ] [ ] [ ] [ ] . we propose a new method to use the data in such a context based on the fact that the cumulative number of tests grows linearly at the early stage of the epidemic. figure shows that this is a reasonable assumption for the new-york state data from mid-march to mid-april. our numerical simulations show a very good concordance between the number of reported cases produced by the model and the data in two very different situations. indeed, figures and correspond respectively to an epidemic without and with public intervention to limit the number of transmissions. this is an important observation since this shows that testing data and reported cases are not sufficient to evaluate the real amplitude of the epidemic. to solve this problem, the only solution seems to include a different kind of data to the models. this could be done by studying statistically representative samples in the population. otherwise, biases can always be suspected. such a question is of particular interest in order to evaluate the fraction of the population that has been infected by the virus and their possible immunity. in figure , we compare the testing dynamic (day by day variation in the number of tests) and the reported cases dynamic (day by day variation in the number of reported). indeed, the daily case is extremely complex, but we also obtain some relatively robust curve for the cumulative numbers. our model give a good fit for this cumulative cases. in figure , we compare multiple testing strategies. by increasing , , and times the number of tests, we can observe that this efficient up to some point and but increasing times is not making a big difference. therefore, it is useless to test to many peoples and there must an optimum (between the cost of the tests) and the efficiency in the evaluation of the number of cases. infective diseases of humans: dynamics and control simple models for containment of a pandemic the mathematical theory of epidemics mathematical epidemiology mathematical models in epidemiology vertically transmitted diseases: models and dynamics the basic reproductive number of ebola and the effects of public health measures: the cases of congo and uganda modelling the covid- epidemics in brasil: parametric identification and public health measures influence mathematical tools for understanding infectious disease dynamics unreported cases for age dependent covid- outbreak in japan the mathematics of infectious diseases modeling infectious diseases in humans and animals understanding unreported cases in the -ncov epidemic outbreak in wuhan, china, and the importance of major public health interventions predicting the cumulative number of cases for the covid- epidemic in china from early data a covid- epidemic model with latency period a model to predict covid- epidemics with applications to south korea predicting the number of reported and unreported cases for the covid- epidemics in china estimating the asymptomatic proportion of coronavirus disease (covid- ) cases on board the diamond princess cruise ship covert coronavirus infections could be seeding new outbreaks using early data to estimate the actual infection fatality ratio from covid- in france effect of a one-month lockdown on the epidemic dynamics of covid- in france transmission of -ncov infection from an asymptomatic contact in germany mathematics in population biology estimation of the transmissionrisk of the -ncov and its implication for public health interventions evolving epidemiology and impact of non-pharmaceutical interventions on the outbreak of coronavirus disease nowcasting and forecasting the potential domestic and inter-national spread of the -ncov outbreak originating in wuhan, china: a modelling study key: cord- -pjoqahhk authors: li, x.; cai, y.; ding, y.; li, j.-d.; huang, g.; liang, y.; xu, l. title: discrete simulation analysis of covid- and prediction of isolation bed numbers date: - - journal: nan doi: . / . . . sha: doc_id: cord_uid: pjoqahhk background. the outbreak of covid- has been defined by the world health organization as a pandemic, and containment depends on traditional public health measures. however, the explosive growth of the number of infected cases in a short period of time has caused tremendous pressure on medical systems. adequate isolation facilities are essential to control outbreaks, so this study aims to quickly estimate the demand and number of isolation beds. methods. we established a discrete simulation model for epidemiology. by adjusting or fitting necessary epidemic parameters, the effects of the following indicators on the development of the epidemic and the occupation of medical resources were explained: ( ) incubation period, ( ) response speed and detection capacity of the hospital, ( ) disease cure time, and ( ) population mobility. finally, a method for predicting the reasonable number of isolation beds was summarized through multiple linear regression. results. through simulation, we show that the incubation period, response speed and detection capacity of the hospital, disease cure time, degree of population mobility, and infectivity of cured patients have different effects on the infectivity, scale, and duration of the epidemic. among them, ( ) incubation period, ( ) response speed and detection capacity of the hospital, ( ) disease cure time, and ( ) population mobility have a significant impact on the demand and number of isolation beds (p < . ), which agrees with the following regression equation: n = p * (- . + . i + . m + . t + . t ) * ( +v). sars-cov- is a novel coronavirus that has the ability of human-to-human transmission , . health organization (who) as a pandemic (the worldwide spread of a new disease) . as of april , , more than cases of covid- have been reported in different countries and territories . currently, researchers around the world are making every effort to clarify the biological and epidemiological characteristics of sars-cov- and strive to explore effective coping strategies [ ] [ ] [ ] . covid- is extremely contagious, and its explosive growth in a short space of time has caused tremendous pressure on medical resources . conventional medical conditions have difficulty meeting the needs of the detection capability for suspected cases and the number of isolation beds for treatment and isolation [ ] [ ] [ ] . the number of isolation beds is crucial to reduce the scale of infection and reduce the number of fatalities. too few isolation beds can lead to the continuation of the epidemic, and too many isolation beds may cause waste and environmental damage [ ] [ ] [ ] . to explore a reasonable number of isolation beds, we established a discrete simulation model of epidemics based on covid- . by setting different epidemic indicators (incubation period, hospital response time, healing time, population mobility rate), we analyzed the changing laws of the epidemic situation, peak value, and scale of the epidemic in different situations. in particular, we pay attention to the occupation of medical resources during the outbreak. we summarized some epidemic indicators related to the number of isolation beds through multiple linear regression and estimated the number of isolation beds through these indicators. the conclusion is practical, which can provide support for the reasonable scheduling of medical resources and the search for effective solutions in the current outbreak or in similar future outbreaks. the incubation period is an asymptomatic stage in the early stages of disease development, at which point patients themselves will not suspect that they have been infected. we compared the infection situation of different incubation periods. anova showed that the mean values of all indicators among groups were not exactly the same (p< . ). the detailed differences among groups are shown in table . further analysis was performed using multiple comparisons. the maximum number of incubation cases, the sum of infected cases and the corresponding date for the peak number of inpatients were significantly different between any two groups (p< . ). the long incubation period promoted these epidemic indicators. the maximum number of newly confirmed cases and their corresponding dates, corresponding date of peak incubation cases, maximum value of rt, duration of the epidemic, and maximum number of inpatients were not exactly equal among the groups in different incubation periods, and there were significant differences among some groups (p< . ). the above indicators increased with the increase in the incubation period. (figure anova showed that there were no significant differences in the corresponding date of the maximum number of incubation cases among different response time groups (p> . ), and the mean of the other indicators among the groups was not exactly equal (p< . ). the detailed differences among the groups are shown in table . the multiple comparisons showed that the sum of infected cases between any two groups was significantly different (p< . ). the sum of the infected cases increased with the extension of response time. the maximum number of newly confirmed cases and its corresponding date, the maximum number of incubation cases, the maximum of rt, the duration of the epidemic, the maximum number of inpatients and its corresponding date were not exactly equal among the groups at different response times, and there were significant differences among some groups (p< . ). the above indicators increased with the extension of the response time ( figure ). table . further multiple comparisons showed that the maximum number of inpatients between any two groups was significantly different (p< . ). the maximum number of inpatients increased with the extension of the healing time. the maximum value of rt and the corresponding date of peak inpatient number were not exactly equal among the groups at different healing times, and there were significant differences among some groups (p< . ). the extension of the healing period promoted the increase in the above indicators ( figure ). the population mobility rate refers to the proportion of people in motion to the total population. we compared the infection situation of different population mobility rates. anova showed that there were no significant differences in the maximum value of rt and the duration of the epidemic among the groups (p> . ), and the mean of other indicators among the groups was not exactly equal (p< . ). the detailed differences among the groups are shown in table . the multiple comparisons showed that the sum of the infected cases between any two groups was significantly different (p< . ). the increase in the population mobility rate caused a higher sum of the infected cases. the maximum number of newly confirmed cases and its corresponding date, the maximum number of incubation cases and its corresponding date, the maximum number of inpatients and their corresponding dates were not exactly equal among the groups in different population mobility rates, and there were significant differences among some groups (p< . ). among them, the maximum number of newly confirmed cases, the maximum number of incubation cases and the maximum number of inpatients increased with the increase in the population activity rate. at the extreme value of %, that is, when everyone was inactive, the corresponding date of peak incubation cases and the corresponding date of peak inpatient number were significantly advanced, which was significantly different from that of the other groups (figure ). . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not certified by peer review) the copyright holder for this preprint this version posted july , . is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not certified by peer review) the copyright holder for this preprint this version posted july , . . the hospital isolation capacity is defined by the proportion of the actual quantity of isolation beds to the demanded quantity for isolation beds. anova showed that there were no significant differences in the maximum number of newly confirmed cases, the maximum number of incubation cases and the maximum value of rt among different isolation capacity groups (p> . ), and the mean of the other indicators among the groups was not exactly equal (p< . ). the detailed differences among groups are shown in table . the multiple comparisons showed that the corresponding dates of the peak number of inpatients and the duration of isolation facilities at their full capacity between any two groups were significantly different (p< . ). the corresponding date of the peak inpatient number was delayed with the decrease in isolation capacity. the duration of isolation facilities at their full capacity increased with the lack of isolation beds, which showed a quadratic relationship (supplement material). in addition, the corresponding date of peak newly confirmed cases, the corresponding . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not certified by peer review) the copyright holder for this preprint this version posted july , . . date of incubation cases, the sum of infected cases and the duration of the epidemic were not exactly equal among the groups, and there were significant differences among some groups (p< . ). in the severely inadequate isolation capacity group ( %), the corresponding date of incubation cases, the mean value of rt, the sum of infected cases and the duration of the epidemic were significantly increased (p< . ) ( figure ). to further explore the rational setting of isolation beds in the medical system under multifactor epidemic conditions, we analyzed the relationship between different epidemic indicators and a reasonable number of isolation beds by multiple regression analysis. the t-test showed that the independent variables of incubation period, population mobility rate, hospital response time and . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted july , . . https://doi.org/ . / . . . doi: medrxiv preprint healing time significantly affected the reasonable number of isolation beds (p< . ) (table ) . finally, we obtained the following regression equation (r = . ): n = p * (- . + . i + . m + . t + . t ) * ( +v) n indicates a reasonable number of isolation beds. p means population, which refers to the total population in the corresponding area. i means incubation period of the epidemic. m means population mobility rate, which refers to the proportion of people in motion to the total population. t means hospital response time, which refers to the time it takes for a patient to develop the first symptom until a clear diagnosis is obtained. t means healing time, which refers to the average time from admission to discharge. t and t can be estimated based on a certain number of cases. considering that a certain number of isolation beds should be reserved to cope with emergency situations, we set v as the reserve amount. this article suggests that it is generally reserved at % according to the simulation results. table . multiple-linear regression analysis of the reasonable number of isolation beds here, the prediction method was applied to an example ( figure ). the predicted number of isolation beds were basically consistent with the model operating results. the parameter setting is the same as model . . . . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted july , . . https://doi.org/ . / . . . doi: medrxiv preprint according to result . . , we found that a longer incubation period significantly promoted the infectivity, scale and duration of the epidemic. by tracking this phenomenon, it was found that the patients in the incubation period and without mobility restriction caused a high level of transmission before displaying symptoms by contacting others, which was similar to the views of li p and jiang x et al. , . to contain an outbreak, early detection of suspected cases is critical . some studies have described that a longer incubation period may be beneficial for epidemic control , as this allows the centers for disease control and prevention (cdc) to have more time to deal with the overall epidemic. this conclusion may be more applicable to some known diseases, but for unknown diseases, we believe that a longer incubation period represents a more dangerous signal, making the development of the epidemic uncontrollable , . due to the reduced predictability of the disease outbreak scale, it is more difficult to track patients. as a result, the disease may spread to a wider range of people, making it difficult to control. our results showed that increasing hospital response speed could improve infectivity and scale of an outbreak, which is consistent with previous research [ ] [ ] [ ] . shortening the hospital response time depends on the public's awareness of epidemic prevention and the level of medical technology. on the one hand, the public needs to pay more attention to the epidemic and actively cooperate with early detection; on the other hand, medical technology determines the time it takes for the detection method to give an accurate result. in addition, if sufficient isolation facilities can be provided, the centralized isolation of all suspected patients who cannot be excluded can also help reduce the hospital response time , . compared with the hospital response time, the impact of the population mobility rate on the duration of the epidemic is not significant in the case of abundant medical resources. however, . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted july , . . https://doi.org/ . / . . . doi: medrxiv preprint unrestricted population mobility can cause a large medical load and consumption of medical supplies and will generate a large number of infected cases, resulting in adverse socioeconomic impacts [ ] [ ] [ ] . in reality, medical resources are not only limited, but also lack most of the time. therefore, it cannot be considered that the population mobility has no effect on the duration of the outbreak. the detailed consequences of inadequate medical resources are discussed in section . . in extreme cases, all people stopped activities, i.e., when the mobility rate is %, the overall scale of the epidemic is dramatically reduced, and the corresponding date of the epidemic peak sharply advances. it is speculated that the disease may saturate in a small area after taking this extreme prevention and control measure; thus, the transmission will be completely blocked. since there is no new-generation infection, different transmission laws are displayed. we summarized the related factors that affected the demanded quantity of isolation beds by multiple regression analysis. furthermore, multivariate regression analysis was used to estimate the reasonable number of isolation beds: n = p * (- . + . i + . m + . t + . t ) * ( +v) the regression equation shows that the population mobility rate is the variable with the highest weight, which indicates that the restriction of population mobility is the critical factor to contain outbreaks and effectively reduce the load on the medical system , , , . we believe that reducing the epidemic scale by restricting population mobility can also help to provide time for the establishment of temporary isolation. in practice, the incubation period (i) can be estimated from the time between traceable harmful exposure to the time of the first symptom , . the population mobility rate (m) can be roughly estimated by the ratio of the population in unrestricted mobility, including medical personnel and administrative personnel, to the total population. , . hospital response time (t ) and healing time (t ) can be estimated based on a certain number of cases. in addition, we recommend v = % as a reserve to address emergency situations under actual conditions. the . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted july , . . https://doi.org/ . / . . . doi: medrxiv preprint above indicators are easy to obtain and estimate, which provides a feasible guarantee for using this method to estimate a reasonable number of isolation facilities. more importantly, estimating the number of isolation facilities based on the epidemic situation and relevant parameters of the medical system will help to predict the pressure of the medical system in different areas in advance. this will provide decision-making support for the rational arrangement of medical resources and epidemic control . . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted july , . . https://doi.org/ . / . . . doi: medrxiv preprint . method in this discrete simulation model, we use the java language to carry out object-oriented programming. the eight states of people were defined as normal, shadow, supershadow, suspected, confirmed, isolated, cured, and dead. at the same time, for the times of being infected, suspected, confirmed, isolated, cured, and died were independent attributes configured to simulate the process with the actual world process, so the program can produce all attributes, including each moment, and record, analyze and calculate the statistics for each simulation individual. in the model, the length of the incubation period, the time from being a suspected case to being diagnosed, the length of isolation, the rate of population mobility, the probability of infection, and the probability of death after infection can be adjusted or fitted as the necessary parameters for simulation. most of the time, the parameters follow the normal distribution model, and the mean and standard deviation are defined by the parameters. the probability follows the random number model and is set by the probability value. the simulation object state transition logic is shown in figure . . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted july , . . https://doi.org/ . / . . . doi: medrxiv preprint c. there is a fixed response time interval between the onset of patient symptoms and the moment of hospital diagnosis. the hospital always has enough resources to make a diagnosis when the the shadow the confirmed the suspected the quarantined the cured . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted july , . . https://doi.org/ . / . . . doi: medrxiv preprint response time is reached. d. when exposed to an infected person within a dangerous distance, the probability of infection and death after infection is constant. e. after being cured, the hospital will discharge the patient and be released from isolation at once. f. unless otherwise specified, these characteristics do not change over time during disease transmission. g. the simulation model ends when all patients have been discharged. h. the patient can not be contagious or infected again after being cured. population mobility rate: the percentage of the population that has willingness to move. healing time: the mean time between being in isolation and being discharged. incubation period: the time from infection to self-detection of suspected symptoms. fatality rate: the probability of death after diagnosis. dead time:mean time from diagnosis to death. hospital response time: the time from the patient's suspected symptoms to a definitive diagnosis. transmission rate: the probability of being infected by contact with an infected person within an unsafe distance. in order to prevent deviations in the simulation process, each set of parameters was repeated times. after removing outliers, the mean was used to draw the curve and analysis. this model is used to discuss the influence of the disease incubation period on epidemic infection and medical resource occupation. in this model, we make the assumption that the hospital's isolation capacity is strong enough to admit all patients. to achieve a single variable, we assign the following parameters: the total population= , number of initially infected persons= , population mobility rate= , healing time= days (standard deviation= days), hospital response time= day, fatality rate= . , and dead time= days (standard deviation= days) remained the same during the simulation. the experimental groups were as follows: the mean incubation period was set at , , , and days. this model is used to discuss the impact of the hospital response time on epidemic infection and medical resource occupation. in this model, the hospital's isolation capacity is strong enough to admit all patients. to achieve a single variable, we assign the following parameters: the total population= , number of initially infected persons= , incubation period= days (standard deviation= days), population mobility rate= , healing time= days (standard deviation= days), fatality rate= . , and dead time= days (standard deviation= days) remained the same during the simulation. the experimental groups were as follows: the hospital response time was set at , , , and days. . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted july , . . this model is used to discuss the influence of the hospital cure time on the epidemic infection and medical resource occupation. in this model, the hospital's isolation capacity is strong enough to admit all patients. to achieve a single variable, we assign the following parameters: the total population= , number of initially infected persons= , incubation period= days (standard deviation= days), population mobility rate= , hospital response time= day, fatality rate= . , and dead time= days (standard deviation= days) remained the same during the simulation. the experimental groups were as follows: the hospital healing time was set at , , , and days. this model was used to discuss the influence of the hospital cure time on the epidemic infection and medical resource occupation. in this model, the hospital's isolation capacity is strong enough to admit all patients. to achieve a single variable, we assign the following parameters: the total population= , number of initially infected persons= , incubation period= days (standard deviation= days), hospital response time= day, healing time= days (standard deviation= days), fatality rate= . , and dead time= days (standard deviation= days) remained the same during the simulation. the experimental groups were as follows: the population flow rate was set to , %, %, % and %. this model was used to discuss the influence of the hospital isolation capacity on the epidemic infection. to achieve a single variable, we assign the following parameters: the total population= , number of initially infected persons= , incubation period= days (standard deviation= days), population mobility rate= , hospital response time= day, healing time= days (standard deviation= days), fatality rate= . , and dead time= days (standard deviation= days) remained the same during the simulation. the experimental groups were as follows: the hospital isolation capacity was set to %, %, % and %. multilinear regression (mlr) analysis was used to evaluate the impact of the simulation parameters on the dependent variable demand for the number of isolation beds. meanwhile, the prediction method was applied to an example under different healing times to test the accuracy. continuous variables are compared using the average. using spss v . (ibm authorized central south university to use) for data analysis, analysis of variance (anova) was used to analyze the level of significant difference between the groups. when the variance is homogeneous, the least significant difference (lsd) method and multiple comparisons are used to analyze between any two groups; when the variance is not homogeneous, tamhane's t tests and the multiple comparison are used to compare the mean between any two groups. the grubbs method was . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted july , . . https://doi.org/ . / . . . doi: medrxiv preprint used to address outliers. for all statistical analyses, the test level was α = . . r is used to represent the goodness of fit of the multiple regression to measure the fitting degree of the estimated model to the observed values. the table of regression analysis lists the results of the significance test of independent variables (using t-test) and the p value of t-test, indicating whether independent variables have a significant influence on dependent variables. the codes of the current study are available in the github repository, https://github.com/coolleafly/cov_sim/. the average data generated by simulations for each model are included in this published article (and its supplementary information files). other data are available from corresponding author upon reasonable request. . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted july , . . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted july , . . https://doi.org/ . / . . . doi: medrxiv preprint the eight states of people were defined as normal, shadow, supershadow, suspected, confirmed, isolated, cured, and dead. the times of being infected, suspected, confirmed, isolated, cured, and died were independent attributes configured. the length of the incubation period, the time from being a suspected case to being diagnosed, the length of isolation, the rate of population mobility, the probability of infection, and the probability of death after infection can be adjusted or fitted as the necessary parameters for simulation. the data for each curve is the average of simulations. in this model, the ability of hospital treatment had no effect on the number of cumulative and newly confirmed cases. but the maximum number of inpatients increased and its corresponding peak dates delayed with the extension of the healing time. , and (d) respectively represent the time-varying curves of the cumulative number of confirmed cases, the number of newly confirmed cases, the number of inpatient cases, and the daily effective reproductive number under different population mobility rates. the data for each curve is the average of simulations. restrictions on population mobility made the peak number of cumulative confirmed cases, newly confirmed cases, and inpatient cases decreased. population mobility rate had no effect on the value of rt. however, the different transmission laws are displayed in extreme cases. cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not certified by peer review) the copyright holder for this preprint this version posted july , . . https://doi.org/ . / . . . doi: medrxiv preprint the daily effective reproductive number under different isolation capacities. the data for each curve is the average of simulations. the peak number of cumulative confirmed cases and inpatient cases increased with the decrease in the isolation capacity. in addition, with the decrease in isolation capacity, the corresponding date of the peak inpatient number was delayed and the duration of isolation facilities at their full capacity increased. isolation capacity had no effect on the newly confirmed cases and the value of rt. . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not certified by peer review) the copyright holder for this preprint this version posted july , . . https://doi.org/ . / . . . doi: medrxiv preprint epidemiological and clinical characteristics of cases of novel coronavirus pneumonia in wuhan, china: a descriptive study sars-cov- cell entry depends on ace and tmprss and is blocked by a clinically proven protease inhibitor a simple prediction model for the development trend of -ncov epidemics based on medical observations challenges of treating adenovirus infection: application of a deployable rapid-assembly shelter hospital novel coronavirus, poor quarantine, and the risk of pandemic the response of milan's emergency medical system to the covid- outbreak in italy covid- and italy: what next? the lancet covid- : the current situation in afghanistan. the lancet global health covid- pandemic in west africa. the lancet global health covid- : towards controlling of a pandemic transmission of covid- in the terminal stage of incubation period: a familial cluster identifying locations with possible undetected imported severe acute respiratory syndrome coronavirus cases by using importation predictions incubation periods impact the spatial predictability of cholera and ebola outbreaks in sierra leone the incubation period of coronavirus disease (covid- ) from publicly reported confirmed cases: estimation and application nowcasting and forecasting the potential domestic and international spread of the -ncov outbreak originating in wuhan, china: a modelling study improved molecular diagnosis of covid- by the novel, highly sensitive and specific covid- -rdrp/hel real-time reverse transcription-polymerase chain reaction assay validated in vitro and with clinical specimens feasibility of controlling covid- outbreaks by isolation of cases and contacts. the lancet fangcang shelter hospitals: a novel concept for responding to public health emergencies an investigation of transmission control measures during the first days of the covid- epidemic in china the effect of human mobility and control measures on the covid- epidemic in china spatio-temporal patterns of the -ncov epidemic at the county level in hubei province we thank bruce yong for providing a prototype of the program. we referred to his open-source virus broadcast simulation project and used it as the prototype of our simulation design. with bruce yong's consent, we redesigned some programs according to the epidemic situation. thank xl and yl conceptualized the study, analyzed the data, and drafted and finalized the manuscript.yc and lx contributed to the concept, draft, and finalization of the paper. yd, gh, and yl read drafts and provided input. all authors approved the final version of the manuscript. we declare no competing interests. key: cord- -g bsul u authors: voinson, marina; alvergne, alexandra; billiard, sylvain; smadi, charline title: stochastic dynamics of an epidemic with recurrent spillovers from an endemic reservoir date: - - journal: journal of theoretical biology doi: . /j.jtbi. . . sha: doc_id: cord_uid: g bsul u abstract most emerging human infectious diseases have an animal origin. while zoonotic diseases originate from a reservoir, most theoretical studies have principally focused on single-host processes, either exclusively humans or exclusively animals, without considering the importance of animal to human transmission (i.e. spillover transmission) for understanding the dynamics of emerging infectious diseases. here we aim to investigate the importance of spillover transmission for explaining the number and the size of outbreaks. we propose a simple continuous time stochastic susceptible-infected-recovered model with a recurrent infection of an incidental host from a reservoir (e.g. humans by a zoonotic species), considering two modes of transmission, ( ) animal-to-human and ( ) human-to-human. the model assumes that (i) epidemiological processes are faster than other processes such as demographics or pathogen evolution and that (ii) an epidemic occurs until there are no susceptible individuals left. the results show that during an epidemic, even when the pathogens are barely contagious, multiple outbreaks are observed due to spillover transmission. overall, the findings demonstrate that the only consideration of direct transmission between individuals is not sufficient to explain the dynamics of zoonotic pathogens in an incidental host. recent decades have seen a surge of emerging infectious diseases (eids), with up to forty new diseases recorded since the s ( jones et al., ) . sixty percent of emerging human infectious diseases are zoonotic, i.e. are caused by pathogens that have an animal origin ( jones et al., ; taylor et al., ) . the world health organization defines zoonotic pathogens as "pathogens that are naturally transmitted to humans via vertebrate animals". the epidemics caused by eids impact the societal and economical equilibria of countries by causing unexpected deaths and illnesses thereby increasing the need for health care infrastructures and by interfering with travel ( morens and fauci, ) . moreover, the risk of eids being transmitted to humans from wildlife is increasing because of the recent growth and geographic expansion of human populations, climate change and deforestation, which all in-crease the number of contacts between humans and potential new pathogens ( jones et al., ; keesing et al., ; murray and daszak, ) . given most eids have an animal origin, it is crucially important to understand how infections spread from animal to human populations, i.e. by spillover transmission. there is numerous empirical evidence that the epidemiological dynamics of infectious diseases is highly dependent on the transmission from the reservoir (the reservoir will be defined following ashford's definition ( ) , i.e. a pathogen is persistent in the environment of the incidental host, see table for details). the start of an outbreak is promoted by a primary contact between the reservoir and the incidental host (i.e. host that becomes infected but is not part of the reservoir) leading to the potential transmission of the infection to the host population. moreover, multiple outbreaks are commonly observed during an epidemic of zoonotic pathogens in human populations, for instance in the case of the epidemic of the nipah virus between and ( luby et al., ) . with regards to the ebola virus, some twenty outbreaks have been recorded since the discovery of the virus in ( de la vega et al., ) . this number of outbreaks undoubtedly underestimates the https://doi.org/ . /j.jtbi. . . - /© elsevier ltd. all rights reserved. definitions of a reservoir from the literature. the reservoir is mostly used as defined by the centre for disease control and prevention (cdc) . two other definitions have been proposed to clarify and complete the notion of reservoir in the case of zoonotic pathogens. on the one hand, haydon et al. ( ) define the reservoir from a practical point of view in order to take into account all hosts epidemiologically connected to the host of interest (i.e. target host), to implement better control strategies. on the other hand, ashford ( ) establishes a more generalizable definition: for a given pathogen there is a single reservoir. authors refs "any animal, person, plant, soil, substance or combination of any of these in which the infectious agent normally lives" centre for disease control and prevention ( ) "all hosts, incidental or not, that contribute to the transmission to the target host (i.e. the population of interest) in which the pathogen can be permanently maintained" haydon et al. ( ) "an ecologic system in which an infectious agent survives indefinitely" ashford ( ashford ( , total number of emergences because not all emergences necessarily lead to the spread of the infection from an animal reservoir to the host population ( jezek et al., ) . while the reservoir has an important role for causing the emergence of outbreaks, the role of spillover transmission on the incidental epidemiological dynamics is rarely discussed. empirically, it is generally difficult to distinguish between direct transmission and transmission from the reservoir. only in the case of non-communicable diseases it is easily possible to measure the importance of the recurrent transmission from the reservoir, since all infected individuals originate from a contact with the reservoir. for instance, the h n virus, for which most human cases are due to a contact with an infected poultry, approximately spillovers have been listed during the epidemic of ( zhou et al., ) . for pathogens that are able to propagate from one individual to another, the origin of the infection can be established according to patterns of contacts during the incubation period ( chowell et al., ; luby et al., ). most often, if an infected individual has been in contact with another infected individual in his recent past, direct transmission is considered as the likeliest origin of the infection. however, both individuals might have shared the same environment and thus might have been independently infected by the reservoir. this leads to overestimating the proportion of cases that result from person-to-person transmission. moreover, when the pathogen infects an individual and the latter does not produce secondary cases then the detection of emergence is unlikely. pathogen spillover is often neglected in epidemiological theoretical models. it is generally assumed that the epidemiological dynamics of outbreaks is driven by the ability of the pathogen to propagate within hosts. for instance, a classification scheme for pathogens has been proposed by wolfe et al. ( ) , including five evolutionary stages in which the pathogen may evolve ranging from an exclusive animal infection (stage i) to an exclusive human infection (stage v) ( fig. ) ( wolfe et al., ) . the intermediate stages are those found for the zoonotic pathogens (stages ii-iv). lloyd-smith et al. ( ) , proposed to enhance the classification scheme by differentiating the stages ii-iv by the ability of pathogens to propagate between individuals in the incidental host (i.e. as a function of the basic reproductive ratio r ): the noncontagious pathogens ( r = , stage ii), pathogens barely contagious inducing stuttering chains of transmission ( < r < , stage iii) and contagious pathogens inducing large outbreaks ( r > , stage iv) ( lloyd-smith et al., ) . however, the role of the reservoir is not clearly defined, and spillover effects on the epidemiological dynamics are not discussed. only a few models have investigated the dynamics of eids by taking into account explicitly the transmission from the reservoir to the incidental host. lloyd-smith et al. ( ) have analysed modelling studies of zoonotic pathogens and concluded that models incorporating spillover transmission are dismayingly rare. more recent models aimed at investigating the dynamics of eids by taking into account the spread of the pathogen using multihost processes but disregarding the persistence of the pathogen in the reservoir ( singh et al., ) , or by focusing on the dynamics and conditions of persistence of the pathogen between two populations ( fenton and pedersen, ) . models that have considered an endemic reservoir are disease-specific and do not generate generalizable dynamics ( chowell et al., ; nieddu et al., ) . more recently, singh and myers ( ) developed a susceptible-infected-recovered (sir) stochastic model coupled with a constant force of infection. the authors are mostly interested in the effect of population size and its impact on the size of an outbreak. however, this approach does not allow teasing apart the contribution of the incidental host transmission from that of the transmission from the reservoir in modulating the dynamics of zoonotic pathogens. in this paper, we aim to provide general insights into the dynamics of a zoonotic pathogen (i.e. pathogens classified in stages ii-iv) emerging from a reservoir and its ability to propagate in an incidental host. to do so, we developed a continuous time stochas- fig. . representation of the classification scheme of pathogens proposed by wolfe et al. ( ) . a pathogen may evolve from infecting only animals (stage i) to infecting only humans (stage v). each stage corresponds to a specific epidemiological dynamics in the incidental host. stage ii corresponds to few spillovers from animals (e.g. bats) to humans with no possible transmission between humans. stage iii corresponds to few stuttering chains of transmission between humans that go extinct (no outbreaks). stage iv corresponds to large outbreaks in human population but the pathogen cannot be maintained without the reservoir. tic model that can dissociate the effect of between-host (i.e. direct) transmission from the effect of spillover (i.e. reservoir) transmission. a multi-host process with a reservoir and an incidental host is considered. the epidemiological processes are stochastic, which is particularly relevant in the case of transmission from the reservoir and more realistic because only a small number of individuals are expected to be infected at the beginning of an outbreak. the model makes a number of assumptions. first, the epidemiological processes are much faster than the demographic processes. second, the pathogen in the reservoir is considered as endemic and might contaminate recurrently the incidental host. third, an individual cannot become susceptible after having been infected. as a consequence, the total number of susceptible individuals in the incidental host decreases during the epidemic. this is what is expected for an epidemic spreading locally during a short period of time (at the scale of a few thousands individuals during weeks or months, depending on the disease and populations considered). we then harness the model to predict the effects of both spillover transmission and direct transmission on the number and the size of outbreaks. outbreaks occur when the number of cases of disease increases above the epidemiological threshold. in the case of non emerging infectious diseases, an epidemiological threshold is used to gauge the start of outbreaks. for instance for the seasonal influenza the epidemiological threshold is calculated depending on the incidence of the disease during the previous years ( tay et al., ) . in the case of emerging infectious diseases, no incidence is normally expected in the population so from a small number of infected individuals, the outbreak can be considered to spread. we show that, regarding the epidemiological dynamics, the recurrent emergence of the pathogen from the reservoir in the incidental host is as important as the transmission between individuals of the incidental host. we conclude by discussing the implications of these results for the classification of pathogens proposed by lloyd-smith et al. ( ) . a continuous time stochastic susceptible-infected-recovered (sir) compartmental transmission model ( kermack and mckendrick, ) with recurrent introduction of the infection into an incidental host by a reservoir is considered ( fig. ). our goal here is not to study a disease in particular but to provide general insights of the reservoir effect on the epidemiological dynamics of the incidental host. the infection is assumed to propagate quickly relatively to other processes such as pathogen evolution and demographic processes. the reservoir is defined as a compartment where the pathogen is persistently maintained, this pathogen is then considered as endemic. the population is fully mixed. an individual can be infected through two types of transmission, from the reservoir by the spillover transmission and by direct contact between individuals. we neglect the possibility for reverse infection from the incidental host to the reservoir. the incidental host is composed of n individuals. the infection can spillover by contact between the reservoir and the incidental host at rate τ s where s is the number of susceptible individuals and τ is the rate at which an individual becomes infected from the reservoir. in the incidental host, the infection can propagate by direct contact at rate βsi where i is the number of infected individuals and β is the individual rate of infection transmission. an infected individual can recover at rate γ . the propensity of the pathogen to be transmitted between individuals within host is expressed in terms of the basic reproductive ratio of the pathogen, r , which is widely used in epidemiology. r corresponds to the average number of secondary infections produced by an infected individual in an otherwise susceptible population. in a deterministic model, for a pathogen to invade the population, r must be larger than in the absence of reservoir. in a stochastic model, the higher the r the higher the probability for the pathogen to invade the population. in a sir model, the basic reproductive ratio r equals to βn / γ . individuals in the recovered compartment do not contribute anymore to the transmission process. since we assume that demographic processes are slower than epidemic processes, the number of susceptible individuals decreases during the epidemic due to the consumption of susceptible by the infection until the extinction of the susceptible population. in other words, in our model, r will decrease because of the successive spillovers from the reservoir. we expect this to occur especially at short space and time scales (a local population during the course of weeks or months). to analyse the dynamics in the incidental host, three statistics will be studied (i) the mean number of outbreaks, (ii) the mean size of the recurrent outbreaks during an epidemic and (iii) the individuals mean size of the largest outbreak occurring during an epidemic. we consider the appearance of an outbreak when the incidence of the infection exceeds the threshold c and define the maximum size of an outbreak as the largest number of infected individuals during the largest outbreak. stochastic simulations. the epidemiological dynamics described previously can be simulated with the following algorithm (simulations were run in c++). the population state is assumed to be known at time t . a total event rate ( ), only depending of the state of the population at time t , is calculated for each iteration. a) the total event rate of the continuous time stochastic sir model is given by: b) the next event time is t = t + δ where δ is exponentially distributed with parameter . c) the next event to occur is randomly chosen: direct transmission, spillover transmission or recover with respective probabilities βsi / , τ s / and γ i / . we performed stochastic individual-based simulations of the epidemics with spillover transmission, using rates as presented in fig. . the incidental host is initially ( t = ) composed of susceptible individuals ( n = s = ). the infection is considered as endemic in the reservoir. simulations are stopped when there are no susceptible individuals anymore. an outbreak begins when the number of infected individuals reaches the epidemiological threshold c ( c = infected individuals in the simulations) and ends when there is no infected individuals anymore ( i = ). stochastic simulations were run for values of the basic reproductive ratio ( r ) ranging from to and of the spillover transmission ( τ ) ranging from − to − , , simulations are performed for each parameter set. all other parameter values are detailed in table . approximation by a branching process. the epidemiological model with recurrent introduction of the infection into an incidental host by a reservoir can be approximated by a branching process with immigration from the reservoir to the incidental host at the beginning of the infectious process (thus assuming that individual "birth" and "death" rates of infected individuals are constant during the starting phase of an outbreak). the individual birth and death rates are respectively βn , the transmission rate and γ , the recovery rate. the immigration rate corresponds to the spillover rate, τ n , at the beginning of the infection. in other words, we assume that the number of susceptibles is n to study the beginning of the infection, which is a good approximation as long as few individuals have been infected. we distinguish between two regimes in the incidental host, the subcritical regime when r < and the supercritical regime when r > . we suppose that at time t = a single individual is infected by the spillover transmission. as illustrated in fig. , three patterns are observed (i) a stuttering chain of transmission that goes extinct, i.e. infection spreads inefficiently, (corresponding to stage ii in wolfe's classification, see served are wider for a higher threshold value (compare fig. a and b), whereas stage iii is narrower. when the direct and the spillover transmissions are low, it is more difficult for the infection to reach a higher threshold. thus, there are more stuttering chains of transmission. in the same way, when the direct or the spillover transmission is high, a large outbreak is observed then some stuttering chains of transmission occur but do not reach the epidemiological threshold. for both values of epidemiological threshold ( fig. (a) and (b)), a "bulb" is observed in the stage iii where the direct transmission is high. after the occurrence of a large outbreak, the susceptible population became small. hence the next excursion is very unlikely to reach the epidemiological threshold. however, a high enough spillover transmission rate is able to counterbalance the small effective r and to produce other outbreaks after the large one. the "bulb" is less pronounced in the case of a higher threshold ( fig. (b) ) because the susceptible population consumed during the large outbreak is important leading to the failure for the next excursion to reach a high epidemiological threshold. we aim at approximating the mean number of outbreaks in the case where the spillover transmission rate τ and the reproductive number r are small (subcritical case corresponding to r < ). the method of approximation is the following: let us denote by s i the number of susceptible individuals at the beginning of the i th excursion. during the i th excursion, we set this number of susceptibles to its initial value s i , and consider that the rate of new infections is βis i . we thus obtain a branching process with individual birth (infection) rate βs i and individual death (recovery) rate γ . when there is no more infected individuals, we compute the mean number of recovered individuals produced by this branching process excursion, denoted by e [ k(s i , β, γ )] , and make the approximation that where e [ k(s i , β, γ )] can be computed and equals (see ( ) in other words, the initial number of susceptible individuals for the ( i + )th excursion is the initial number of susceptible individuals for the i th excursion minus the mean number of recovered individuals produced during the i th excursion under our branching process approximation. we repeat the procedure for the ( i + )th excursion, and so on, until k satisfies s k > and s k + ≤ (no susceptible anymore). in order to be considered as an outbreak, an excursion has to exceed c individuals, where we recall that c is the epidemiological threshold. under our branching process approximation, the probability for the i th excursion to reach the epidemiological threshold (see appendix a ) is: as a consequence, our approximation of the mean number of out- where s = n, and the s i 's are computed as described in ( ) . the mean number of outbreaks computed with the branching process is a good approximation compared to numerical simulations for a small spillover transmission ( − ≤ τ ≤ − ) ( fig. ) . the spillover transmission added in our model introduces the infection recurrently and allows the infection to spread even for a pathogen barely contagious ( r < ). according to fig. when r < the number of outbreaks increases when the direct transmission between individuals increases. indeed, the higher the direct transmission, the higher the probability for the excursions to reach the epidemiological threshold ( c ). the number of outbreaks can be high because when the direct transmission is smaller than , the infection spreads inefficiently and does not consume a large number of susceptibles allowing the next excursion to exceed the epidemiological threshold. fig. shows that the average number of outbreaks is a non-monotonic function of the direct transmission ( r ) and the spillover transmission ( τ ). more precisely, fig. (b) shows that for intermediate and low values of spillover transmission ( − ≤ τ ≤ − ), the average number of outbreaks increases until r ∼ then decreases until it reaches one outbreak when the direct transmission is high ( r > . ). moreover, we observed an increasing number of outbreaks with τ when the pathogen is barely contagious. by contrast, in fig. (a), we show that the average number of outbreaks decreases when τ becomes large ( τ . − ). the supercritical case ( r > ) is now considered and the spillover transmission rate ( τ ) is still supposed small. process approximation, respectively. the average number of outbreaks approximated is evaluated when the spillover transmission τ is small. for the numerical simulations, τ = − has been chosen. there is a break in the dotted curve (branching process) because our approximation is not valid in the critical regime (when r is close to ). in this case, two different types of excursions occur in the incidental host: (i) a large outbreak which consumes, with a probability close to one, a large proportion of susceptible individuals and (ii) multiple excursions before and after a large outbreak which each consumes few susceptible individuals. we let o before ( n, β, γ ) and o after ( n, β, γ ) denote the number of outbreaks occurring respectively before and after the large outbreak. because r > , the probability to have one large outbreak is close to one. hence we make the approximation that one large outbreak occurs during the epidemic, and the total number of outbreaks ( o total to be part of outbreaks occurring before the large one, an excursion has to satisfy two conditions (i) to have a size higher than the epidemiological threshold c , and (ii) to be of a size not too large otherwise it would correspond to the large outbreak. to be more precise, this condition will correspond to the fact that the supercritical branching process used to approximate this excursion does not go to infinity. as a consequence, o before ( n, β, γ ) can be approximated by (see appendix b ) : to approximate the number of outbreaks after the large outbreak ( o after ( n, β, γ ) ), we need to know how many susceptible individuals remain in the population. the number of susceptibles consumed before the large outbreak is negligible with respect to the number of susceptibles consumed during the large outbreak. hence we can consider that the initial state of the large outbreak is n susceptibles, one infected individual and no recovered individual. which has one trivial solution ( s = n) and a non-trivial solution with no explicit expression denoted n after ( n, β, γ ) . after the large outbreak, the reproductive ratio for the next excursions, denoted r a f ter , is subcritical ( r a f ter < ) (see appendix b ) and the number of outbreaks after the large one, denoted o after , can be approximated using eqs. ( ) -( ) . the branching process approximations of the mean number of outbreaks in the supercritical regime, depicted in fig. , are close to the mean number of outbreaks computed by numerical simulations when the recurrent infection from the reservoir is small. the number of outbreaks decreases when the pathogen becomes highly contagious to reach one outbreak when r > . . when the infection is introduced in the incidental host by the spillover transmission, the probability to reach the epidemiological threshold depends on the direct transmission between individuals. when the direct transmission increases the infection spreads more efficiently consuming a large number of susceptible individuals allowing few or no other excursion to reach the epidemiological threshold and producing only one outbreak when r > . . we now focus on the effect of the spillover transmission with a pathogen barely contagious ( r < ) on the number of outbreaks. we exclude for the sake of simplicity the cases very close to the critical case, that is to say, − r is not too close to . because we consider the subcritical case ( r < ), the excursions are small and at the beginning of the epidemiological dynamics, we make the approximation that the spillover transmission rate is constant equal to τ n , and the direct transmission rate is equal to βni . we thus consider a birth and death process with constant immigration rate τ n , individual birth rate βn and individual death rate γ . we are interested in the effect of the parameter τ on the mean number of outbreaks. in particular we aim at estimating the value of τ maximising the mean number of outbreaks, denoted τ opt . a first quantity which will help giving us an idea of the order of magnitude of the values of τ to be considered is the mean number of infected individuals at large times. this quantity, denoted m i , equals (see for instance eq. ( . ) in bailey, ) : in particular, when the mean number of infected individuals is much larger than c , and when on the contrary the mean number of infected individuals is negligible with respect to c . let us first consider the first case ( eq. ( ) ), and choose α > such that then we can show (see appendix c. ), that the probability p c that a first infection by the reservoir gives rise to an outbreak (that is to say the number of infected individuals reaches c before ) is larger than: moreover, we can show that if an excursion reaches the level c , it has a probability close to one to lead to a large outbreak consuming a large number of susceptible individuals. thus only few stuttering chains of transmission will emerge. in fig. (a) , when τ is large ( τ > − thus τ n/ (cγ ( − r )) ≥ ), only one outbreak is observed because the large number of spillovers prevents the outbreak from dying out. let us now consider the second case ( eq. ( ) ). recall that in the case of emerging infectious diseases, the threshold c can be considered as small. hence we may consider without loss of generality that ( ) implies: in this case, we can prove (see appendix c. ) that the probability that the number of infected individuals is higher than the threshold c is : thus the probability for the number of infected individuals to reach the epidemiological threshold c is small under condition ( ) . as a consequence, few outbreaks will occur. indeed, the successive spillovers by the reservoir will produce outbreaks with a small probability, but will nevertheless consume susceptible individuals, until there is no more susceptible in the population. according to fig. (a) , when a small effect of spillover transmission ( τ < − ) and a small reproductive ratio ( r ≤ . ) are considered ( τ n/ (cγ ( − r )) < . − ) then the number of outbreaks is small. in the case of a slightly higher direct transmission rate ( r = . ), each spillover has a non negligible probability to become an outbreak (more precisely . when c = ) and the number of outbreaks is higher. we thus predict that the number of outbreaks will tend to be large when the average size of an excursion is close to the epidemiological threshold ( m i c ). these observations allow us to give a first rough upper bound of the optimal value τ opt . indeed, if the mean number of infected individuals ( e [ i ]) is equal to c , the ra- represents the variance of the number of infected individuals, see appendix c. ) and whose value belongs to [ . , ] when c = for the values of r considered, which is large. moreover in this case the distribution of i is skewed to the right (see fig. c . ). this implies that the number of infected individuals will be larger than c a large fraction of the time producing outbreaks which do not go extinct before a new infection by the reservoir, and thus producing few outbreaks. hence we may conclude that τ opt is smaller than the τ leading to a mean number of infected individuals c . for instance for the parameters considered in fig. (a) , this gives that τ opt is smaller than: let us now be more precise on the estimation of τ opt . to this aim, we will apply two results of the theory of branching processes with immigration. the first one, which can be found in bailey ( , eq. ( . )) describes the total infectious lines over the course of the infection, denoted by m : notice that m is necessarily larger than as an infection from the reservoir is needed to generate the first infectious line. the second result is the mean number of infectious lines expected to be present at any time, that is to say in the theory of branching processes the number of distinct immigrants which have descendants alive at a given moment. for large times, this mean number ( m i ) is equal to: pardoux, ) , where e [ t ] denotes the mean lifetime of a branching process without immigration, with individual birth rate βn , individual death rate γ , and initial state . this expression can be computed explicitly (see appendix c. . ) and equals: thus leading to the expression: we will divide an excursion into m / m i blocks of m i simultaneous infectious lines (thus without immigration). the idea for such an estimation is the following: it is known that if a poisson process has k jumps during a time interval, the jumps are uniformly distributed during this time interval. as the infections by the reservoir follow approximately a poisson process with parameter τ n , and we know that in expectation m infections by the reservoir occur before all infected individuals are removed, we divide the epidemic in homogeneous blocks. we choose these blocks to have an initial number of m i infected individuals to allow the use of results on branching processes with immigration. the initial number of infected individuals in each block is thus m i , and as a consequence the infection has a probability to reach the threshold c (see appendix c. ). hence the probability for the whole excursion to reach the threshold c can be approximated by we want this probability to be not too close to , otherwise most susceptible individuals would be consumed without giving rise to an outbreak. we also want this probability to be not too close to . indeed, as we have shown in the beginning of this section, this would correspond to a case where τ n / γ is much larger than c( − r ) and once the infected number of individuals has reached the value c it would be very likely to reach a large value and consume a large number of susceptible individuals. as a consequence, we would have at the limit only one large outbreak. we thus choose to equalize this probability to one half to get an estimation of τ opt . notice that this choice is arbitrary but has only a small effect on the final results. for instance, a choice of . or . would give very close results. the most important is to stay away from and . as a conclusion, τ opt is estimated as the unique solution to: the unicity of the solution is proved in appendix c. . fig. (b) presents the values of τ maximising the number of outbreaks and their estimations (dots) obtained by the branching process approximations. the estimates derived under the branching process approximation give good results, with error ranging from to % regardless the value of the epidemiological threshold ( table ) . values of the optimal spillover transmission from numerical simulations and estimations. the optimal spillover transmission is calculated for r being equal to . , . , . , and . . we present the values of the optimal spillover for two values of epidemiological threshold c = (bold) and c = (not bold). the errors are ranging from to % with a mean error of %. to get the estimation of τ opt we have made several approximations. first we have considered that the spillover rate by the reservoir is constant equal to τ n , whereas it is decreasing and equals to τ s , and that the rate of direct transmission due to infected individuals in the population is βni and not βsi . we believe that these approximations are reasonable because the probability for an excursion to reach the threshold decreases with the consumption of susceptible individuals, and as a consequence, most of the outbreaks will occur at the beginning of the process. however, the real τ opt should be a little bit higher than the one we estimate, to counterbalance the fact that the real infection rates (by the reservoir and the infected individuals) are smaller than the one we use in our calculations. this may explain why in most of the cases we underestimate the real τ opt (see table ). the following approximation we made is the decomposition of the excursions into blocks with an initial number of individuals m i . in the real process there are no simultaneous infections by the reservoir. however this approximation allows to take into account previous infections by the reservoir whose infectious lines are still present. when the ratio, v ar[ i ] / e [ i ] = τ n/γ , is small (see appendix c. ), the value of i stays close to its expectation and few outbreaks occur, as the number of infected individuals rarely reaches . for instance, for r = . and τ = . . − , τ n / γ ∼ . . as decreasing τ increases this ratio, this could explain why we overestimate τ opt for small values of r (because smaller values of r necessitate higher values of τ to get the same probability to reach the threshold c ). during the epidemic, a large outbreak can occur depending on the value of the direct transmission ( r ) and the spillover transmission ( τ ) and corresponds to the largest number of infected individuals. to analyse the effect of the recurrent emergence of the pathogen on the size of the largest outbreak, we model the largest outbreak by a sir deterministic model with a spillover transmission: since no explicit expression of the size of the outbreak can be obtained with the deterministic model, we estimated it using numerical analyses. fig. shows that the maximal number of infected individuals during the largest outbreak increases with the direct transmission ( r ) and the spillover transmission ( τ ). when the direct transmission ( r ) is small, the size of the largest outbreak can differ by orders of magnitude with varying spillover transmission ( τ ). furthermore, a large outbreak can be observed for a pathogen barely contagious ( r < ) when the recurrent emergence of the pathogen is high ( τ − ). zoonotic pathogens constitute one of the most pressing concerns with regards to future emerging diseases, but studies investigating the importance of the role of animal reservoirs for the epidemiological dynamics of infectious diseases are lacking. indeed, most theoretical works only consider pathogen transmission between conspecifics for predicting disease epidemiology. here, we build a continuous time stochastic sir model to consider the statistical process underlying a spillover transmission, i.e. transmission from an animal reservoir to a host. we analyse the model to predict the number and the size of outbreaks as a function of both the spillover transmission and within host. the model shows that spillover transmission influences the epidemiological dynamics as much as the transmission by direct contact between individuals. three different dynamics are observed, ranging from the absence of outbreaks to a single large outbreak. the findings have implications for ( ) modelling the dynamics of eids, ( ) understanding the occurrence of outbreaks in the case of pathogens that are barely contagious and ( ) control strategies. in our results, the appearance of outbreaks depends on both the transmission from the reservoir and the direct transmission between individuals. generally, the occurrence of epidemics in humans is attributed to the ability of the pathogen to propagate between individuals. in the case of a single-host process, the notion of the basic reproductive ratio r seems sufficient to evaluate the spread of the pathogen in a population entirely composed of susceptible individuals. in eids, r is also used to gauge the risk of pandemics. in this way, lloyd-smith et al. ( ) delineate the three stages identified for a zoonotic pathogen ( wolfe et al., ) by using the ability of the pathogen to spread between individuals. each stage corresponds to a specific epidemiological dynamics ranging from a non-contagious pathogen making an outbreak impossible (stage ii, r = ) to a barely contagious pathogen with few outbreaks and stuttering chains of transmission (stage iii, r < ) to a contagious pathogen making a large outbreak possible (stage iv, r > ). the aim of the wolfe's classification is to establish each stage at which a zoonotic pathogen may evolve to be adapted to human transmission only, in order to identify pathogens at potential risk of pandemics. however, in our model, by taking into account the recurrent emergence of the pathogen from the reservoir, the three dynamics that define the three stages will depend on both the spillover transmission and the direct transmission of the pathogen between individuals. the results suggest that in the case of pathogen spilling recurrently over an incidental host, the direct transmission should not be the only parameter to consider. the presence of a reservoir and its associated recurrent spillovers dramatically impact the epidemiological dynamics of infectious diseases in the incidental host. without transmission from the reservoir, the probability to have an outbreak when the pathogen is barely contagious only depends on the direct transmission between individuals, and the outbreak rapidly goes extinct. by contrast, the results show that the recurrent emergence of the pathogen from a reservoir increases the probability to observe an outbreak. spillover transmission enhances the probability to both observe longer chains of transmission and reach the epidemiological threshold (i.e. threshold from which an outbreak is considered) even for a pathogen barely contagious. moreover, this coupling model (reservoir-human transmission) allows the appearance of multiple outbreaks depending on both the ability of the pathogen to propagate in the population and the transmission from the reservoir. zoonotic pathogens such as mers, ebola or nipah are poorly transmitted between individuals ( r estimated to be less than ) ( althaus, ; chowell et al., ; luby et al., ; zumla et al., ) yet outbreaks of dozens/hundreds/thousands of infected individuals are observed. we argue that, as suggested by our model, the human epidemic caused by eids could be due to a recurrent spillover from an animal reservoir. in the case of zoonotic pathogens, it is of primary importance to distinguish between primary cases (i.e. individuals infected from the reservoir) and secondary cases (i.e. individuals infected from another infected individual) to specify which control strategies to implement and how to optimize public health resources. according to the stochastic sir model coupled with a reservoir analysed here, the same dynamics can be observed depending on the relative contribution of the transmission from the reservoir and the direct transmission by contact with an infected individual (see fig. ). for example, a large outbreak is observed either for a high spillover transmission or for a high direct transmission. the proposed stochastic model makes it possible to understand the effects of the infection from the reservoir or from direct transmission on the epidemiological dynamics in an incidental host when empirically this distinction is difficult. empirically, the origin of the infection is established by determining the contact patterns of in-fected individuals during the incubation period. thereafter, the role of control programs implemented could be evaluated in order to determine better strategies. we have considered that the reservoir is a unique population in which the pathogen can persist, which is a simplifying assumption. the pathogen is then endemic in the reservoir and the reservoir has a constant force of infection on the incidental host. the reservoir can be seen as an ecological system comprising several species or populations in order to maintain the pathogen indefinitely ( haydon et al., ) . for example, bat and dromedary camel ( camelus dromedarius ) populations are involved in the persistence of mers-cov and in the transmission to human populations ( sabir et al., ) . in these cases, the assumptions of a constant force of infection can be valid because the pathogen is endemic. however, zoonotic pathogens can spill over multiple incidental hosts and they can infect each other. in the case of the ebola virus, which infects multiple incidental hosts such as apes, gorillas and monkeys ( ghazanfar et al., ) , the principal mode of contamination of the human population is the transmission from non-human primate populations. moreover, the contact patterns between animals and humans is one of the most important risk factors in the emergence of avian influenza outbreaks ( meyer et al., ) . these different epidemiological dynamics with transmission either from the reservoir or from other incidental hosts can largely impact the dynamics of infection observed in the human population, and the investigations of those effects can enhance our understanding of zoonotic pathogens dynamics. in our model, we make a second simplifying assumption by considering that the infection propagates quickly relatively to other processes such as pathogen evolution and demographic processes. this assumption can be not valid in the case of low emergence of the pathogen from the reservoir. indeed, the time between two spillovers can be long and makes the evolution of the pathogen possible inside the reservoir. moreover, during the time between two spillovers, the demography in the incidental host can vary and impact the propagation of the pathogen. in the case of low spillover transmission in the incidental host, the effect of both pathogen evolution and demographic processes can be a topic for future research on the epidemiological dynamics of emerging infectious diseases. in this paper, we have argued that the conventional way for modelling the epidemiological dynamics of endemic pathogens in an incidental host should be enhanced to account for spillover transmission in addition to conspecifics transmission. we have shown that our continuous time stochastic sir model with a reservoir produces similar dynamics to those found empirically (see the classification scheme for pathogens from wolfe et al., ) . this model can be used to better understand the ways in which eids transmission routes impact disease dynamics. in this appendix, we derive results on the branching process approximation stated in section . the main idea of this approximation is the following: when the epidemiological process is subcritical ( r < ), an excursion will modify the state of a small number of individuals with respect to the total population size. during the i th excursions, the direct transmission rate βsi will stay close to βs i i where s i denotes the number of susceptibles at the beginning of the i th excursion. hence, if we are interested in the infected population, the rate βs i i can be seen as a constant individual birth rate βs i . similarly, γ i , which is the rate at which an individual in the population recovers, can be interpreted as a constant individual death rate γ in the population of infected individuals. in this section, we focus on the number of outbreaks when r < and when the rate of introduction of the infection by the reservoir is small ( τ ). that is to say, we consider that each introduction of the infection by the reservoir occurs after the end of the previous excursion created by the previous introduction of the infection by the reservoir. according to eq. ( ) , this approximation is valid as long as the ratio τ / β is small. we first approximate the mean number of susceptible individuals consumed by an excursion. let us consider a subcritical branching process with individual birth rate βn and individual death rate γ . as this process is subcritical, we know that the excursion will die out in a finite time and produce a finite number of individuals. then from britton and pardoux ( ) or van der hofstad ( ) , if we denote by k [ n, β, γ ] the total number of individuals born during the lifetime of this branching process (counting the initial individual), we know that: where p denotes a probability, and hence where e is the expectation. by definition, an excursion is considered as an outbreak only if the maximal number of individuals infected at the same time during this excursion is larger than an epidemiological threshold that we have denoted by c . hence in order to approximate the number of outbreaks we still have to compute the probability for an excursion to be an outbreak. this is a classical result in branching process theory which can be found in athreya and ney ( ) for instance. p (n, β, γ ) : = p ( more than c individuals infected at a given time ) = γ /βn − (γ /βn) c − . (a. ) with these results in hands, the method to approximate the mean number of outbreaks is the following: the probability that the first excursion is an outbreak is fig. a. . probability for an excursion to be of size k . the situation is indicated for a reproductive ratio ( r ) varying from . to . . the number of susceptibles at the beginning of the second excursion is approximated by the second excursion has a probability γ /βs − (γ /βs ) c − to be an outbreak. the number of susceptibles at the beginning of the third excursion is approximated by and the third excursion has a probability γ /βs − (γ /βs ) c − to be an outbreak. the procedure is iterated as long as there is still a positive number of susceptible individuals. this gives . we now focus on the case r = βn/γ > . in this case the approximating branching process is supercritical and go to infinity with a positive probability. in the case where the epidemic process describes small excursions, the branching process approximation is still valid, but in the case where it describes a large excursion, then a large fraction of susceptible individuals is consumed and the branching approximation is not valid anymore. however, as all the quantities (susceptible, infected and recovered individuals) are large, a mean field approximation is a good approximation of the process. here the mean field approximation is the deterministic sir process, whose dynamics is given by: let us first focus on the small excursions occurring before the large one. as they are small, they can be approximated by a branching process. here, unlike in the previous section, the approximating branching process z is supercritical, as βn > γ . we compute its probability to drift to infinity: as we will see, a supercritical branching process with individual birth rate βn and individual death rate γ conditioned to go extinct has the same law as a subcritical branching process with individual birth rate γ and individual death rate βn . indeed, if we denote by z n the successive values of this branching process, we get for every couple of natural numbers ( n, k ): where p (a | b ) denotes the probability of the event a when b is realised. we used again in this series of equalities classical results on branching processes that can be found in athreya and ney ( ) . as a consequence, if we denote by g [ n, β, γ ] the number of susceptible individuals consumed by the excursion of a supercritical branching process with individual birth rate βn and individual death rate γ conditioned to go extinct, we get: , and the probability for this excursion to have a size bigger than the epidemiological threshold c is as the number of susceptible individuals stays large until the large excursion occurs, we may keep n as the initial number of susceptibles at the beginning of the excursions instead of replacing it by their mean value, as we have done in the previous section. the different quantities we have just computed allow us to approximate the number of small excursions before the large excursion: in expectation, we have as k and l ( k ) are small with respect to n , this can be approximated by finally, notice that in (b. ) , s is a decreasing quantity, and i is a non-negative quantity, which varies continuously. hence ˙ i = i(βs − γ ) has to be negative before i hits . as a consequence, this ensures that the epidemic is subcritical after the large outbreak. in this section, we focus on the effect of the reservoir transmission rate (parameter τ ) on the number of outbreaks when the infection is subcritical ( r < ). the idea is the following: first, as excursions of subcritical branching processes are small, we can make the approximation that, at the beginning, the infection rate by the reservoir is constant equal to τ n , and that the direct transmission rate is equal to βni . making this approximation allows us to handle the two processes of infection (by contact and by the reservoir) independently, and to use known results on branching processes with immigration. recall that i denotes the number of infected individuals. we first assume ( ) and prove inequality ( ) . let us choose α > such then for any ≤ k ≤ c − , the jump rates of the process i are: this implies that once one individual is infected, the probability for the number of simultaneously infected individuals to reach c before the recovery of all infected individuals is larger than the probability that a birth and death process with initial state and birth ( b ) and death ( d ) rates satisfying b d reaches the state c . applying (a. ) , we deduce that this probability p c c. satisfies: . this proves ( ) . moreover, as α has been taken large, the infectious process stays supercritical (in the sense that the next event is more likely to be an infection than a recovery) until a size k satisfying and thus if the number of infected individuals reaches c it is likely to reach a large value and consume a large number of susceptible individuals. as we approximate the infection process by a branching process with constant immigration, the law of i under this approximation converges to a well-known law, provided by eq. ( . ) in bailey ( ) : from this law, we can deduce the probability for i to be equal to any integer k : and thus, the probability for i to be larger than c is recall that we assumed that we thus get the following inequality now let us notice that for any i ≥ : − r we thus get where to get the last inequality we computed the maximum of the function x → xa x − for a = ( + r ) / . as a conclusion, for a fixed r < and a small enough τ , the probability for the number of infected individuals at a given time to be higher than c is bounded by a function of r time τ n / (cγ ( − r )) . this probability is thus small when the last term is small. recall eq. (c. ) . it allows us to compute the variance of i , as follows: in particular, e [ i] = τ n γ . in this section we provide an expression for the term e [ t ] which appears in the definition of m i (see eq. ( ) ). this expression derives from the following equality, which can be found in athreya and ney ( ) . let t ≥ and t denotes the extinction time of the excursion of a branching process, with one individual at time . then we have this allows one to compute the expectation of t as follows: where we made an integration by parts. thus recall that if we consider a branching process with individual birth rate βn , individual death rate γ , and initial state k ≤ c , the probability for this process to reach the size c is athreya and ney, for instance). we use this result to approximate the probability for a block of the excursion to reach the threshold c by where we recall that m i is the mean number of simultaneous excursions generated by different infections from the reservoir. notice that this is an approximation, as m i is not necessarily an integer. we end this appendix with the proof of the unicity of the solution to ( ) . to simplify the notations, we introduce the function f , which at x associates: first we notice that f is only defined for x ≤ c . otherwise the term in brackets would be negative. second, notice that if x ≤ and for any c ≥ and r < , this shows that if x ≤ , f ( x ) > / . we now determine the sign of f ( x ) for x belonging to the interval [ , c ]. a direct computation gives: as the two logarithms are negative for x belonging to ( , c ), we deduce that f ( x ) < for x belonging to ( , c ). as f ( ) > / and f (c) = , we conclude that f (x ) = / has a unique solution on the interval ( , c ). this is equivalent to the fact that (c. ) has a unique solution τ opt which belongs to this ends the proof. estimating the reproduction number of ebola virus (ebov) during the outbreak in west africa what it takes to be a reservoir when is a reservoir not a reservoir? branching processes the elements of stochastic processes with applications to the natural sciences centre for disease control and prevention synthesizing data and models for the spread of mers-cov, : key role of index cases and hospital transmission ebolavirus evolution: past and present community epidemiology framework for classifying disease threats ebola, the killer virus identifying reservoirs of infection: a conceptual and practical challenge ebola between outbreaks: intensied ebola hemorrhagic fever surveillance in the democratic republic of the congo global trends in emerging infectious diseases impacts of biodiversity on the emergence and transmission of infectious diseases a contribution to the mathematical theory of epidemics epidemic dynamics at the human-animal interface recurrent zoonotic transmission of nipah virus into humans movement and contact patterns of long-distance free-grazing ducks and avian influenza persistence in vietnam emerging infectious diseases: threats to human health and global stability human ecology in pathogenic landscapes: two hypotheses on how land use change drives viral emergence extinction pathways and outbreak vulnerability in a stochastic ebola model processus de markov et applications: algorithmes, réseaux, génome et finance: cours et exercices corrigés co-circulation of three camel coronavirus species and recombination of mers-covs in saudi arabia outbreak statistics and scaling laws for externally driven epidemics the structure of infectious disease outbreaks across the animal-human interface exploring a proposed who method to determine thresholds for seasonal influenza surveillance risk factors for human disease emergence random graphs and complex networks origins of major human infectious diseases biological features of novel avian influenza a (h n ) virus middle east respiratory syndrome the authors have been supported by the "chair modélisation mathématique et biodiversité" of veolia environnement-ecole polytechnique-museum national d'histoire naturelle-fondation x, france. outbreaks. now we focus on the large excursion. we use eq. (b. ) to approximate its dynamics. this equation is well-known, and it is easy to obtain the equation satisfied by the final number of susceptible individuals: from (b. )in particular, if we are interested in the time t f when there is no more infected individual and we suppose that at time there is only one infected individual we getthat is to say, s ( t f ) and s ( ) are related by the equationrigorously, the value of s ( ) depends on the number of susceptible individuals consumed by the small excursions before the large excursion, but we have seen that this number is small compared to the population size, n . hence the number of susceptible individuals remaining after the large excursion can be approximated by the smallest solution of:notice that it is easy to have an idea of the error caused by a small variation of the initial state. indeed, if we denote by s f the smallest solution of (b. ) and by s f − l(k ) the solution when key: cord- - j r dd authors: hult, henrik; favero, martina title: estimates of the proportion of sars-cov- infected individuals in sweden date: - - journal: nan doi: nan sha: doc_id: cord_uid: j r dd in this paper a bayesian seir model is studied to estimate the proportion of the population infected with sars-cov- , the virus responsible for covid- . to capture heterogeneity in the population and the effect of interventions to reduce the rate of epidemic spread, the model uses a time-varying contact rate, whose logarithm has a gaussian process prior. a poisson point process is used to model the occurrence of deaths due to covid- and the model is calibrated using data of daily death counts in combination with a snapshot of the the proportion of individuals with an active infection, performed in stockholm in late march. the methodology is applied to regions in sweden. the results show that the estimated proportion of the population who has been infected is around . % in stockholm, by - - , and ranges between . % - . % in the other investigated regions. in stockholm where the peak of daily death counts is likely behind us, parameter uncertainty does not heavily influence the expected daily number of deaths, nor the expected cumulative number of deaths. it does, however, impact the estimated cumulative number of infected individuals. in the other regions, where random sampling of the number of active infections is not available, parameter sharing is used to improve estimates, but the parameter uncertainty remains substantial. to understand the spread of the novel coronavirus, sars-cov- , at an aggregate level it is possible to model the dynamic evolution of the epidemic using standard epidemic models. such models include the (stochastic) reed-frost model and more general markov chain models, or the corresponding (deterministic) law of large numbers limits such as the general epidemic model, see [ ] . there is an extensive literature on extensions of the standard epidemic models incorporating various degrees of heterogeneity in the population, e.g. age groups, demographic information, spatial dependence, etc. these additional characteristics make the models more realistic. for instance, it is possible to evaluate the effect of various intervention strategies. more complex models also involve additional parameters that need to be estimated, contributing to a higher degree of parameter uncertainty. a problem when calibrating, even the standard epidemic models, to covid- data is that there are few reliable sources on the number of infected individuals. publicly available sources provide data on the number of positive tests, the number of hospitalizations, the number of icu admission and the number of deaths due to covid- . in some cases, small random samples of an active infection may be available. for example, the swedish folkhälsomyndigheten performed such a test in stockholm with about subjects in early april . moreover, there is still no consensus in the literature on the value of important parameters such as the basic reproduction number r and the infection fatality rate. a useful approach to incorporate the parameter uncertainty in the models is to consider a bayesian framework. in the bayesian approach parameter uncertainty is quantified by prior distributions over the unknown parameters. the impact of observed data, in the form of a likelihood, yields, via bayes' theorem, the posterior distribution, which quantifies the effects of parameter uncertainty. the posterior can be used to construct estimates on the number of infected individuals, predictions on the future occurrence of infections and deaths, as well as uncertainties in such estimates. in this paper an seir epidemic model with time-varying contact rate will be used to model the evolution of the number of susceptible (s), exposed (e), infected (i), and recovered (r) individuals. a time varying contact rate is used to capture heterogeneity in the population, which causes the rate of the spread of the epidemic to vary as the virus spreads through the population. moreover, the time varying contact rate allows modeling the effect of interventions aimed at reducing the rate of epidemic spread. a poisson point process is introduced to model the occurrence and time of deaths. random samples of tests for active infections are treated as binomial trials where the success probability is the proportion of the population in the infectious state. the methods are illustrated on regional data of daily covid- deaths in sweden. it is demonstrated that, by combining the information in the observed number of deaths and random samples of active infections, fairly precise estimates on the number of infected individuals can be given. by assuming that some parameters are identical in several regions, estimates for regions outside stockholm can also be provided, albeit with greater uncertainty. our approach is inspired by [ ] where the authors considers a bayesian approach to model an influenza outbreak. the main extensions include the introduction of the poisson point process to model the occurrence of deaths, the addition of random sampling to test for infection, and an extension to multiple regions. to evaluate the posterior distribution we employ markov chain monte carlo (mcmc) sampling. samples from the posterior are obtained using the hamiltonian monte carlo algorithm, nuts, by [ ] , implemented in the software stan, which is an open source software for mcmc. to model the spread of the epidemic we consider the deterministic seir model [ , ] , which is a simple deterministic model describing the evolution of the number of susceptible, exposed, infected, and recovered individuals in a large homogeneous population with n individuals. the epidemic is modeled by {(s t , e t , i t ), t ≥ }, where s t , e t and i t represent the number of susceptible, exposed and infected individuals at time t, respectively. the total number of recovered and deceased individuals at time t ≥ is always given by n − s t − e t − i t . the epidemic starts from a state s , e , i with s + e + i = n , and proceeds by updating, the parameters are the contact rate β > , the rate ν of transition from the exposed to the infected state and the recovery rate γ > . note that i t represents the number of individuals with an active infection at time t, whereas n − s t is the cumulative number of individuals who have been exposed, and possibly infected, recovered or deceased, up until time t. in the context of the covid- epidemic the contact rate cannot be assumed to be constant, primarily due to interventions implemented in the early stage of the epidemic. moreover, as the seir model describes the evolution at an aggregate level, a time varying contact rate may be used to capture inhomogeneities in the population. if, for example, the epidemic is initiated in a rural area the contact rate may be rather low, but as the epidemic reaches major cities the contact rate will be higher. the resulting seir model with time varying contact rate is given by clearly, one needs to put some restriction on the amount of variation of the contact rate. in this paper a gaussian process prior will be used on the log contact rate, which restricts the amount of variation in time, but is sufficiently flexible to capture the reduction in contact rate after the interventions. when observations on the number of infected and recovered individuals are available, the model ( ) can be fitted to these observations. in the context of covid- , observations on the number of infected and recovered individuals are unavailable. there are many symptomatic individuals who are not tested and potentially a large pool of asymptomatic individuals. in this paper we will rely on the number of registered deaths due to covid- to calibrate the model. in addition we will incorporate the test results from a random sample that provides a snapshot on the number of individuals with an active infection. to model the occurrence of deaths due to covid- we consider the following poisson point process representation. we refer to [ ] for details on poisson point processes. let f denote the infection fatality rate, that is, the probability that an infected individual eventually dies from the infection. consider the number of individuals that enters the infected state on day t, that is, νe t . each such infected individual has probability f to eventually die from the infection. conditional on death due to the infection, the time from infection until death is assumed independent of everything else and follows a probability distribution with probability mass function p s d . each individual that dies may be represented as a point (t, τ ) in e := {(t, τ ) ∈ n : τ ≥ t}, where t denotes the time of entry to the infected state and τ the time of death of the individual. the number of deaths at time τ can then be computed by counting the number of points on the line ∪ τ t= (t, τ ). the number of deaths, and the corresponding time of infection and time of death is conveniently modelled by a poisson point process on e. let ξ be a poisson point process on e with intensity we may interpret a point at (t, τ ) of the poisson point process as the time of infection, t, and the time of death, τ , of an individual who dies from the infection. the number of deaths d τ that occurs at time τ is then given by summing up all the points of the point process on the row corresponding to τ , ξ(∪ τ t= (t, τ )). since the rows are disjoint this implies that d , d , . . . are independent with each d τ having a poisson distribution with parameter throughout this paper p s d is the probability mass function of a negative binomial distribution with mean s d . more precisely, a parametrization of the negative binomial distribution with parameters r, s d will be used, where the value r = will be used throughout as this fits well with the distribution of observed duration from symptoms to death in the study by [ ] . in this section we provide the assumptions on the prior distributions and derive the expression of the likelihood of the model. note that λ τ is a function of all the parameters of the model, θ = ({β t }, ν, γ, s , f, s d ). the parameters, their interpretation and prior distribution are summarized in table . actually, since the contact rate is positive, a gaussian process (gp) prior will be used for the natural logarithm of the contact rate, denoted log-gp in the sequel. the gaussian process has a constant mean µ and a squared-exponential covariance kernel k with parameters α, ρ, δ such that to compute the likelihood the observed number of daily deaths, d , d , . . . , d t , table . specification of the parameters and prior distributions. will be used, in combination with a random sample of n tests for active infection, performed at a time t , when such test result is available. the number of individuals z with positive test result has a bin(n , i t /n ) distribution. the full likelihood is given by: the joint prior is the product of the marginal priors and leads, by bayes's theorem, to the posterior, the expected number of daily deaths λ τ , the cumulative number of deaths and the cumulative number of infected individuals n −s t are all functions of θ and their distribution can therefore be inferred from the posterior p Θ|d,z . by sampling from p Θ|d and iterating the dynamics ( ) estimates of these quantities may be obtained along with the effects of parameter uncertainty. moreover, predictions on the future development of the above mentioned quantities can be obtained by extrapolating the contact rate into the future. as the posterior distribution is unavailable in explicit form it is necessary to employ monte carlo methods. in the next section markov chain monte carlo methods are briefly described to sample from the posterior. . . multiple regions. the seir model ( ) and the derivation of the likelihood ( ) considers a single region. in the context of multiple regions it may be reasonable to assume that some parameters are identical. for example, when considering multiple regions of sweden below it will be assumed that the rate, ν, from exposed to infected, the recovery rate, γ, the infection fatality rate, f , and the duration, s d are identical in all regions. it is tempting to include interaction terms between the regions as infected individuals from one region may travel to another region and cause new infections. in this paper, it will be assumed that each region has its own time varying contact rate that incorporates fluctuations in new infections due to import cases from other regions. the likelihood from multiple regions is simply the product of the marginal likelihood for the individual regions and the prior is the product of the marginal priors for each parameter. thus, for two regions the prior will be the product of two gaussian process priors for the respective log contact rates for the two regions and the product of the marginal priors for the remaining parameters. markov chain monte carlo (mcmc) methods in bayesian analysis aims at sampling from the posterior distribution. this is non-trivial because the marginal distribution of the data, which acts as normalizing constant of the posterior is practically impossible to compute. in mcmc algorithms the posterior is represented as a target distribution. the algorithms rely on the construction of a markov chain whose invariant distribution is the target distribution. standard mcmc methods are based on acceptance-rejection steps, where random proposals are accepted or rejected with a probability that does not require knowledge of the normalizing constant, e.g., metropolis-hasting and gibbs sampling [ , , ] . when the target distribution is complex and multi-modal, standard methods may lead to poor mixing of the markov chain and slow convergence to the target distribution. to overcome slow mixing of the markov chain gradient-based sampling can be applied, which adapt the proposal distribution based on gradients of the target, see e.g. [ ] . in this paper we will employ a hamiltonian monte carlo sampler, the no-u-turn sampler (nuts) by [ ] in combination with automatic differentiation to numerically approximate the gradients [ ] , which is implemented in the open source software stan. in this section the estimates of the number of infected individuals and predictions on the evolution of the number of deaths and number of infected individuals are provided for ten regions of sweden. the epidemic is considered to start on - - and interventions in sweden began on - - . the joint prior distribution is the product of the marginal priors, and the hyper-parameters are specified in table . the choices of hyper-parameter values are made in line with existing literature on the covid- epidemic. as a general principle we have used informative priors on the parameters ν, γ, and s d , whereas the priors on the time-varying contact rate {β t } and the fatality rate f are uninformative. folkhälsomyndigheten reports that the incubation period is usually around days, which corresponds to /ν ≈ . similarly the expected time to recovery is around days, /γ ≈ . the overall infection fatality rate f is estimated to be in the range . − . , see [ , ] . however, since the infection fatality rate is a very important parameter we have used an uninformative prior and simply use a uniform prior, beta( , ). the expected duration from symptoms to death is around days, see [ ] . samples from the posterior are obtained using the nuts-sampler with a burn-in period of samples and samples after burn-in. mean prior %-c.i. individuals performed between - - and - - . it showed that individuals carried the sars-cov- virus. these results are included in the analysis as a binomial sample of size n = and success probability i t /n where the test date, t , is assumed to be - - . a summary of the marginal posterior distributions is provided in table . the posterior distribution of the time varying contact rate is illustrated in figure . note that although there is great uncertainty about the initial contact rate, the model clearly picks up the reduction in contact rate after the interventions began on - - . the contact rate is gradually reduced around the time of intervention and then remains at a low level. this slow reduction of the contact is, however, not due to stiffness of the gaussian process kernel. we have experimented with a sharp break-point in the contact rate at the time of intervention, but it did not provide more accurate results. on the contrary, the data suggests that the reduction of the contact rate is slow. the contact rate is estimated until - - . after this date the posterior is unreliable. this is because many of the deaths of individuals who are infected after - - have not yet been observed. for this reason, the contact rate is only estimated until - - . to perform estimates and predictions on the future number of daily and cumulative infections and deaths, the contact rate has been extrapolated from its value on - - . the posterior distribution suggests that the contact rate is constant, at a low rate, since roughly - - , which motivates extrapolation into the future, assuming that the interventions remains at the present level. after - - the contact rate is extrapolated, by assuming it will remain constant. figure (top left) shows the observed daily number of deaths (black dots) along with the posterior median (dark red) and % credibility interval (red) for the expected number of daily deaths. figure (top right) shows the observed cumulative number of deaths (black dots) along with the posterior median (dark red) and % credibility interval (red) for the expected cumulative number of deaths. we observe that the parameter uncertainty does not substantially impact the expected number of daily deaths and the peak of the daily number of deaths appears to have occurred by mid april. similarly, the expected cumulative number of deaths in stockholm is likely to terminate slightly above . we emphasize that this is the expected number of deaths, λ τ . since we are considering a poisson distribution for the number of daily deaths an approximate %-prediction interval would be λ τ ± √ λ τ , where λ τ is the poisson parameter on day τ . note from the observed number of daily deaths that the empirical distribution of daily deaths appear to be overdispersed, the variance is substantially larger than the mean. this is likely due to reporting of the data. the data presented at https://c .se/ does not correct the reporting of death dates in hindsight. a comparison at the national level with data provided by folkhälsomyndigheten shows that the official records of the daily number of deaths for sweden does not appear to be overdispersed. nevertheless, even after smoothing the data from https://c .se/ by a moving average over a few days, the results of the simulations remain essentially the same. figure (bottom left/right) shows the posterior median (dark red) and % credibility interval (red) for the daily/cumulative number of infected individuals. although the parameter uncertainty has significant impact on the cumulative number of infected individuals, some conclusions are still possible. as of mid may, the cumulative number of infected individuals has almost reached its terminal value and the spread of the epidemic has slowed down significantly. the estimated cumulative number of infected individuals is . % of the population in stockholm. the estimated number of infected individuals by - - is . %, showing that these results are well in line with the reports of the anti-body test performed at kth , which indicated that % of the population in stockholm had developed anti-bodies against the sars-cov- virus by the first weeks of april. we emphasize that the estimate of the cumulative number of infected individuals in stockholm relies heavily on the inclusion of results from the random sampling performed by folkhälsomyndigheten in late march, early april. without this crucial piece of information similar models to the one analyzed here may provide a significantly higher estimate on the cumulative number of infected. table . marginal posterior median and credibility intervals for region stockholm. . . summary of the results for ten regions of sweden. in this section estimates of the cumulative number of infected individuals are provided for the following regions of sweden: ( ) stockholm (population: . · ) ( ) västra götaland (population: . · ) ( )Östergötland (population: . · ) ( )Örebro (population: . · ) ( ) skåne (population: . · ) ( ) jönköping (population: . · ) ( ) sörmland (population: . · ) ( ) västmanland (population: . · ) ( ) uppsala (population: . · ) ( ) dalarna (population: . · ) the daily death counts for the regions of sweden until - - are obtained from the webpage: https://c .se/. there is no random testing providing information on the proportion of infected individuals outside region stockholm. to estimate the contact rate and the cumulative number of infected individuals in regions outside stockholm, we have implemented the multi-region model pairwise, with two regions in each mcmc simulation, where one region is stockholm and the other region is from the list above. it is assumed that the parameters ν, γ, f, and s d are identical in both regions, but the time varying contact rate and the initial proportion of susceptible individuals are different between the regions. the posterior of the contact rates for the different regions are provided in figure table along with % credibility intervals. overall the proportions are low, far from herd immunity. prop observed daily number of deaths (black dots), the posterior median (dark red) and % credibility interval for the expected daily number of deaths. top right: observed cumulative number of deaths (black dots), the posterior median (dark red) and % credibility interval for the expected cumulative number of deaths. bottom left: the posterior median (dark red) and % credibility interval for the daily number of infected individuals. bottom right: the posterior median (dark red) and % credibility interval for the cumulative number of infected individuals. infectious diseases of humans: dynamics and control the geometric foundations of hamiltonian monte carlo basic estimation-prediction techniques for covid- , and a prediction for stockholm contemporary statistical inference for infectious disease models using stan mathematical tools for understanding infectious disease dynamics stochastic relaxation, gibbs distributions, and the bayesian restoration of images evaluating derivatives: principles and techniques of algorithmic differentiation monte carlo sampling methods using markov chains and their applications the no-u-turn sampler: adaptively setting path lengths in hamiltonian monte carlo random measures early transmission dynamics in wuhan, china, of novel coronavirus infected pneumonia substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (sars-cov ) equation of state calculations by fast computing machines estimates of the severity of coronavirus disease : a model-based analysis the lancet infections diseases estimating clinical severity of covid- from the transmission dynamics in wuhan, china key: cord- -gaemgm t authors: white, laura forsberg; pagano, marcello title: transmissibility of the influenza virus in the pandemic date: - - journal: plos one doi: . /journal.pone. sha: doc_id: cord_uid: gaemgm t background: with a heightened increase in concern for an influenza pandemic we sought to better understand the influenza pandemic, the most devastating epidemic of the previous century. methodology/principal findings: we use data from several communities in maryland, usa as well as two ships that experienced well-documented outbreaks of influenza in . using a likelihood-based method and a nonparametric method, we estimate the serial interval and reproductive number throughout the course of each outbreak. this analysis shows the basic reproductive number to be slightly lower in the maryland communities (between . and . ) than for the enclosed populations on the ships (r( ) = . , se = . ). additionally the effective reproductive number declined to sub epidemic levels more quickly on the ships (within around days) than in the communities (within – days). the mean serial interval for the ships was consistent ( . , se = . and . , se = . ), while the serial intervals in the communities varied substantially (between . , se = . and . , se = . ). conclusions/significance: these results illustrate the importance of considering the population dynamics when making statements about the epidemiological parameters of influenza. the methods that we employ for estimation of the reproductive numbers and the serial interval can be easily replicated in other populations and with other diseases. the emergence of the highly pathogenic avian influenza strain h n has raised concerns of an imminent influenza pandemic. public health workers, government officials and disaster planners have an increasing interest in better understanding the potential impact of an influenza pandemic and possible strategies for containment. crucial in this planning is an understanding of the basic epidemiology of the disease in various settings. this has led to a growing interest in the analysis and understanding of past epidemics, particularly that of , the most virulent and deadly influenza epidemic of the th century. mortality has been estimated at - million people worldwide as a result of influenza in the pandemic [ ] . it is reasonable to suppose that by better understanding the transmission dynamics of the highly pathogenic virus in , we can gain greater insight into the dynamics, and thus potential methods of control, for a future pandemic [ ] . important parameters for understanding disease transmission are the reproductive number and the serial interval [ ] . the basic reproductive number is defined as the average number of secondary infections created from a primary infection in an entirely susceptible population [ , see also ] . a more complex, but perhaps meaningful parameter is the effective reproductive number which defines the average number of secondary infections an infected will create at a given point in the epidemic. this parameter takes into account that not all contacts of an infected individual are with susceptible persons, as well as the impact of public health control measures. control strategies are typically targeted to drive this number below one and maintain it there, as this will lead to eventual extinction of the epidemic. an example of this is herd immunity, or immunity to a disease that is incurred from a sufficiently large proportion of the population being immune to a disease. modeling techniques are often used to determine the proportion of the population that should be vaccinated in order to have the reproductive number low enough to avoid outbreaks of disease [ ] . the serial interval can be defined as the time interval between a primary case presenting with symptoms and its infectee developing symptoms [ , ] . thus this quantity is completely observable. this is a mixture of the incubation period and the infectious period, both of which are useful to understand, but difficult to measure. the sars outbreak of had a relatively long serial interval, estimated to be between and days on average and following a weibull distribution [ ] making case isolation extremely effective in containing the epidemic. methods for the estimation of basic epidemiological parameters are still in development phase. [ ] provides a thoughtful summary of methods for estimating the reproductive number. one particularly interesting and useful method has been previously described by [ ] for estimating the daily reproductive number, r t , or the average number of cases an infected individual on day t would cause. one interesting feature of this method is that for days where no cases are observed, the estimated effective reproductive number is zero. another observation is that this method essentially estimates a curve for the effective reproductive number that traces the epidemic curve, lagged by the average serial interval length. this nonparametric method presupposes information on the serial interval distribution. this is typical as most methods for estimating the reproductive number rely on knowledge of the serial interval. few have described analytical methods for estimating the serial interval, making most methodologies dependent on contact tracing data, which is often difficult and expensive to attain. [ ] describe a method to estimate the reproductive number that relies on limited contact tracing information but not a full estimate of the serial interval. [ ] have recently described a method to estimate the serial interval and then used this estimate with the estimator proposed in [ ] of the daily reproductive number and have applied their method to data from outbreaks of avian influenza in poultry farms in europe. several researchers have studied the pandemic and estimated some of these key epidemiological parameters. estimates have ranged from - for the basic reproductive number, r , when using an seir model with a mean latent period of . days and infectious period of . days [ , ] . using an exponential model and assuming the serial interval to be four days (somewhat based on the assumptions of [ ] ), [ ] estimated r to be . - . for confined settings (such as prison and ships) and . - . for community settings. the estimates for the mean latent and infectious periods come from [ ] and were used again by [ ] and [ ] . it appears that the original estimates were derived from epidemic data, although their source is not well documented. in what follows, we introduce new methodology for the estimation of both the daily reproductive number and the serial interval. we apply this method to data from two outbreaks on military ships in the influenza outbreak, as well as welldocumented outbreaks in five maryland communities. the results from this method are compared to that of [ ] . the results illustrate the differences in infectious disease dynamics between outbreaks in a closed population and a dynamic community. we analyze data from several well-documented influenza outbreaks in . first we consider data from two troop ships that embarked in the late fall of [ ] . the medic reported two initial cases on november . out of passengers ( crew members, soldiers, civilians) became sick with influenza over a day period (attack rate, ar, = . ), though most of the cases occurred within the first fourteen days. the boonah left durban and in five days, on november , reported the first three definitive cases of influenza. those who collected the data note that there were likely some initial cases that were not identified. out of on board ( crew members and troops), cases were reported (ar = . ) in the days of the epidemic. the united states public health service created special surveys of localities during the pandemic [ ] . reported results from six communities in maryland are derived from house-to-house surveys requesting the date of onset of influenza for all infected, and the sex and age of each case of pneumonia and influenza. a summary of these populations is provided in table . we describe a likelihood based methodology for estimating the reproductive number at each day in the epidemic as well as the serial interval. the method builds on that described by [ ] . we assume that the population is closed, that all cases are observed, and use daily case counts only (i.e. number of new cases each day). let n = {n , n , n ,…, n t } represent the daily cases counts of influenza for the t days of the epidemic and x ij represent the number of cases that appear on day j that are infected by individuals that appeared sick on day i. following is a representation of the disease transmission model in the population. . . we assume that the total number of cases produced by those on day i, x i? , are poisson distributed with parameter n i r i , where r i is the reproductive number for cases on day i. we further assume that x i = {x i,i+ , x i,i+ ,…,x i,i+k } follows a multinomial distribution with parameters x i? , p, k, where p = {p , p ,…, p k } represent the distribution of the serial interval. using these assumptions we can construct a likelihood function (see details in the supplemental information), which, when simplified, yields the following convenient form where m i~ri ( x k j~ p j n i{j ) [ ] . maximization of this likelihood with respect to r i and p yields estimates of these parameters. to further simplify this process and create a more parsimonious model, we parameterize p by allowing it to follow a traditional parametric form for a serial interval (for instance a weibull, gamma, log normal, or exponential distribution). then the p j are functions of the parameters of the density (for instance in the case of the gamma distribution, the p j only depend on the shape and rate parameters of the gamma). similarly r i can be modeled parametrically as a function of time. one example of a reasonable model for this is the four parameter logistic curve [ ] [ ] [ ] given by the parameters of this curve describe the initial height of the curve (approximately a+b), the point of inflection (d), the curvature over the inflection (c) and the final height of the curve (a). these parameters have biological meaning in this setting where the initial height corresponds to the values of r i prior to intervention and significant depletion of the susceptible population. the inflection point and its steepness would describe the timing of intervention and the rapidity with which it impacts transmission. the final height would describe the ultimate value of r i, which typically is less than one, indicating that disease transmission is in a sub epidemic state. in our analysis, we also implement the method described by [ ] (hereafter referred to as the garske et al. method) and compare the results of the two methodologies. this method first estimates the generation time distribution using a likelihood based method. then the effective reproductive number is estimated using the method described by [ ] (hereafter referred to as the wt method). we fit the likelihood for both methods using a nelder-mead maximization procedure and use starting values in order to ensure that we reach the global maximum. all analyses were done using r . . . both methods assume homogenous mixing in the population, no missing data (clearly violated with the data from the maryland communities), that a primary case experiences symptom onset prior to any cases that it infects and a completely closed system where all cases are infected by a case that has been observed. in the case of the maryland data, where only a sample of the total number of cases was surveyed, we can observe the efficacy and robustness of these methods with sample data. certainly results should be interpreted with caution, however, as we will show, the results that are obtained are consistent with previous estimates for influenza. standard errors were calculated for the mle method using a parametric bootstrap. one thousand epidemics were simulated using the parameter estimates and estimates were obtained from each of these simulated epidemics. the standard deviation of the estimates was used as the standard error estimates. we used the method described in [ ] to estimate the standard error for their estimates, however our simulations based on their assumption of asymptotic normality yielded a large number of negative estimates for the parameters. it is possible that this is due to the non-independence in the data and lack of theoretical underpinnings for the method that they propose. these results make their standard error estimates infeasible to estimate in this case. therefore we do not present standard error estimates for the results obtained using their methodology. in order to determine the accuracy and relative merit of the estimates obtained from each methodology, we compute one-stepahead residuals and implement a cross validation approach to analyze the generalizeability of the estimates obtained. the onestep-ahead residuals were calculated by first using the estimates from a particular location along with the data to predict the next days' number of cases,Ñ i as follows: eachÑ i is calculated using n , n , …, n i . then the one-stepahead residuals are calculated as we present these residuals averaged over the t days observed. generalizeability of the results was studied using an ad hoc cross validation (cv) technique. this is done by using the estimates obtained from one location to calculate the one step ahead residuals for another location. specifically we use the boonah ship estimates to calculate residuals with the medic data and then use the medic estimates to calculate the residuals for the boonah data. for the maryland communities, we report the average of the residuals obtained using the estimates from one community to predict the epidemics in each of the other four communities, creating five cv estimates (one for each community). table gives the results for the serial interval distribution estimates. notable in these results is the striking consistency in the estimates of the first moment, with the exception of cumberland. the second moments vary much more, however. in general they tend to be much larger for the ships when using the garske et al. method compared to the mle method. for the communities, we observe that they are consistently around for the garske et al. method and vary much more for the mle method. also of interest in these results are the large error estimates, particularly for cumberland, but also to a smaller extent for frederick. this is perhaps indicative of the model not fitting the data as well, for instance the logistic model may not be the best fit in this scenario, or that the lack of census data on cases might be more problematic here. in table and figure , we present the results for estimation of the effective reproductive number. evident in these results, is the large initial reproductive number for the boonah ship. this is likely due to some of the missing data at the beginning of the epidemic and thus the model attributing the large number of cases that rapidly develop to the few individuals who were initially reported. the logistic model fits this as accurately as possible, but perhaps the important message is the qualitative result, indicating that initial transmission in this susceptible, non-quarantined population was very high and rapidly decreased as many became infected. the result is similar for medic though the initial value is not high. we also note that the reproductive number dropped to sub epidemic levels rapidly (around days for both ships). in the maryland communities the initial reproductive number tended to be slightly lower (ranging from . in salisbury to . in table , we present the results of the residual analysis. we notice here that the garske et al. method often does better than the mle method. it is important to point out that the wt method of fitting the effective reproductive model over fits the model and suffers from generalizeability. this method essentially traces the epidemic curve, lagged by the mean of the generation time distribution. thus, according to the residuals, it appears that the wt method outperforms the mle. however, considering the importance of external validation and reproducibility, the model suffers somewhat as evidenced by the cv measures. the exceptions to this are in the case of the boonah where the cv measure is impacted by the large initial mle estimate of the reproductive number and in cumberland where it appears that either the parametric model chosen may not represent the best fit to the data or there were sensitivities to the survey data. we have presented results that are informative with regard to the dynamics of the influenza pandemic in different populations and provide insight into two methodologies for estimating basic epidemiological parameters. both methods assume that the population is closed, there are no missing cases and no migration to or from the population. the second of these assumptions is clearly violated with the data from maryland; however the results appear to be reasonably robust to this discrepancy, except in the case of cumberland. the purpose of this exercise determines to some extent which methodological approach we might favor. if the intent is to simply estimate the parameters for a specific epidemic and better understand what exactly was occurring in that setting, then the method presented by [ ] (garske et al.) appears to provide good fit. the caveat that we see in this method is that by estimating the effective reproductive number with the methodology of [ ] (wt) there is an over fit of this parameter and it essentially traces the epidemic curve, lagged by the mean of the serial interval. it is not clear if this is a desirable or informative property. the mle method has greater promise for generalizeability. while it can be argued that adhering to a parametric definition of the shape of the effective reproductive number leads to a greater chance of lack of fit, it can also lead to a result that can be interpretable for other settings that are similar to that being studied. one can choose any reasonable parametric form for modeling the effective reproductive number. here we have only shown the four parameter logistic model, and feel that it is suitable in most cases where the epidemic curve has a single peak. it is feasible that this model may not apply well in all situations. another approach might be to analyze the data using the garske et al. method and then smooth the plot of the effective reproductive number and from this determine a parametric form that closely approximates the smoothed curve. multiple models could be implemented, then the residual analysis that we have shown provides a valuable tool for model assessment and comparison. the results of these models can be sensitive to underreporting initially in the epidemic. we see this clearly in boonah, where it was acknowledged that there was underreporting early on and this led to us getting very high estimates for the initial reproductive number. similarly, in cumberland, if we remove the first five days of data (three cases on the first day, six cases on the second and then no cases the following three days) we get much more reasonable estimates (m~ : ,ŝ ~ : ) with smaller residuals ( . ). therefore, it is important to note that unusual observations in the first few days can impact the estimates and one should pay careful attention to this possibility. overall both methodologies presented are valuable tools that can be used in tandem for understanding the dynamics of infectious disease epidemics. these methods are easy to implement and interpret. the results that we have presented suggest that the average serial interval for pandemic influenza in was consistently between three and four, regardless of the setting. the standard deviation for the serial interval distribution varied much more for the mle method depending on the location. garske et al. estimates indicate that the value was consistently smaller in the communities than in the ships. it is not clear exactly how to interpret this result. further, we consistently see a large initial value for the reproductive number. in the ships, this value is higher and rapidly drops off, perhaps due to the close quarters and extremely rapid transmission that could take place in these very vulnerable populations. in the communities, the reproductive number tended to drop off later, typically around day thirty. this could be due to a larger initial susceptible population and more complicated dynamics for the disease to spread, leaving large pockets of susceptible individuals unexposed for a longer period of time than in the ships. the influenza pandemic: insights for the st century pandemic influenza: past, present and future the interval between successive cases of an infectious disease infectious diseases of humans definition and estimation of an actual reproduction number describing past infectious disease transmission: application to hiv epidemics among homosexual men in denmark herd immunity and herd effect: new insights and definitions different epidemic curves for severe acute respiratory syndrome reveal similar impacts of control measures a note on generation times in epidemic models transmission dynamics and control of severe acute respiratory syndrome estimating individual and household reproduction numbers in an emerging epidemic estimation in real time the efficacy of measures to control emerging communicable diseases the transmissibility of highly pathogenic avian influenza in commercial poultry in industrialized countries transmissibility of pandemic influenza transmission dynamics of the great influenza pandemic of estimates of the reproduction numbers of spanish influenza using morbidity data an influenza simulation model for immunization studies community trials of vaccination and the epidemic prevention potential containing pandemic influenza with antiviral agents influenza in maryland. preliminary statistics of certain locations a likelihood based method for real time estimation of the serial interval and reproductive number of an epidemic a flexible growth function for empirical use radioligand assay statistical analysis of radioimmunoassay data these results confirm the high pathogenicity of influenza and its ability to rapidly spread through populations. it also appears that the greatest difference between the spread of influenza in a closed population without the ability to implement control measures is a large initial reproductive number that declines rapidly. in more diffuse communities with complicated dynamics, it is likely that the reproductive number will not decline as rapidly. conceived and designed the experiments: lw mp. performed the experiments: lw. analyzed the data: lw. wrote the paper: lw. key: cord- -bqdvx authors: rice, ken; wynne, ben; martin, victoria; ackland, graeme j title: effect of school closures on mortality from coronavirus disease : old and new predictions date: - - journal: bmj doi: . /bmj.m sha: doc_id: cord_uid: bqdvx objective: to replicate and analyse the information available to uk policymakers when the lockdown decision was taken in march in the united kingdom. design: independent calculations using the covidsim code, which implements imperial college london’s individual based model, with data available in march applied to the coronavirus disease (covid- ) epidemic. setting: simulations considering the spread of covid- in great britain and northern ireland. population: about million simulated people matched as closely as possible to actual uk demographics, geography, and social behaviours. main outcome measures: replication of summary data on the covid- epidemic reported to the uk government scientific advisory group for emergencies (sage), and a detailed study of unpublished results, especially the effect of school closures. results: the covidsim model would have produced a good forecast of the subsequent data if initialised with a reproduction number of about . for covid- . the model predicted that school closures and isolation of younger people would increase the total number of deaths, albeit postponed to a second and subsequent waves. the findings of this study suggest that prompt interventions were shown to be highly effective at reducing peak demand for intensive care unit (icu) beds but also prolong the epidemic, in some cases resulting in more deaths long term. this happens because covid- related mortality is highly skewed towards older age groups. in the absence of an effective vaccination programme, none of the proposed mitigation strategies in the uk would reduce the predicted total number of deaths below . conclusions: it was predicted in march that in response to covid- a broad lockdown, as opposed to a focus on shielding the most vulnerable members of society, would reduce immediate demand for icu beds at the cost of more deaths long term. the optimal strategy for saving lives in a covid- epidemic is different from that anticipated for an influenza epidemic with a different mortality age profile. the united kingdom's national response to the coronavirus disease (covid- ) pandemic has been widely reported as being primarily led by modelling based on work, using an individual based model (ibmic) from imperial college london, although other models have been considered. in this paper, we maintain the distinction between epidemiological "model" (ibmic) and software implementation as "code" (covidsim). the key paper (report : impact of non-pharmaceutical interventions (npis) to reduce covid- mortality and healthcare demand) investigated several scenarios using ibmic with the best parameterisation available at the time. contrary to popular perception, the lockdown, which was then implemented, was not specifically modelled in this work. as the pandemic has progressed, the parameterisation has been continually improved as new data become available. the main conclusions of report were not especially surprising. mortality from covid- is around %, so an epidemic in a susceptible population of million people would cause many hundreds of thousands of deaths. in early march , the case doubling time in the uk might have been around three days, meaning that within a week cases of covid- could go from accounting for a minority of available intensive care unit (icu) beds to exceeding capacity. furthermore, with a disease onset delay of more than a week and limited or delayed testing and reporting in place, there would be little measurable warning of the surge in icu bed demand. one table in report , however, shows that closing schools reduces the reproduction number of covid- but with the unexpected effect of increasing the total number of deaths. in this paper, we reproduce the main results from report and explain why, in the framework of the ibmic model, these counterintuitive results were obtained. we chose not to re-parameterise the model as we wanted to replicate the information available to policymakers at the time, specifically highlighting policies for which suppressing the outbreak and saving lives were conflicting choices. ibmic was developed from an influenza pandemic model. the original code used for report has not been released. however, the team at imperial college london, headed by epidemiologist neil ferguson, collaborated with microsoft, github, and the royal society rapid assistance in modelling the pandemic (ramp) initiative to recreate the model in the covidsim code: this version has been stringently externally validated. we used github tagged version . . plus additional patches dated before june , the full technical details of which are published elsewhere. ferguson et al supplied the input files relevant to report that were included in the github release. covidsim performs simulations of the uk at a detailed level without requiring personal data. the model includes millions of individual "people" going about their daily business-for example, within communities and at home, schools, universities, places of work, and hospitals. the geographical representation of the uk is taken from census data, so the distribution of age, health, wealth, and household size for simulated people in each area is appropriate. the model also includes appropriate numbers, age distribution, and commuting distances of people in the simulated schools and workplaces, each in line with national averages. the network of interactions is age dependent: people interact mainly with their own age group and with family, teachers, and carers. the virus (severe acute respiratory syndrome coronavirus ) initially infects random members of this network of interacting coworkers, strangers, friends, and family. whenever an infected person interacts with a noninfected person, there is a probability that the virus will spread. this probability depends on the time and proximity of the interaction and the infectiousness of the person according to the stage of the disease. infected people might be admitted to hospital and might die, with the probability dependent on age, pre-existing conditions, and stage of the disease. this extremely detailed model is then parameterised using the best available expert clinical and behavioural evidence, with coronavirus specific features being updated as more coronavirus specific data become available from the worldwide pandemic. therefore, the model has the required complexity to consider non-pharmaceutical interventions, which would reduce the number of interactions between simulated people in the model (table ) . to predict policymaking, it is assumed that these interventions are implemented when demand for icu beds reaches a particular "trigger" level. as the model contains far more realistic detail than the data available, the results are averages over many runs with different starting conditions, all of which are consistent with known data. the real epidemic is just one of these possibilities, so the code determines the range of scenarios for which plans should be made. this is particularly important when the numbers of localised outbreaks are low: the prediction that local spikes will occur somewhere is reliable, and the most likely places can be identified, but predicting exactly when and where is not possible with the level of data available. all interventions reduce the reproduction number and slow the spread of the disease. however, a counterintuitive result presented in report (table and table a in that report) is the prediction that, once all other considered interventions are in place, the additional closure of schools and universities would increase the total number of deaths. similarly, adding general social distancing to a scenario involving household isolation of suspected cases (case isolation) and household quarantine of family members, with appropriate estimates for compliance, was also projected to increase the total number of deaths. patients or the public were not involved in the design, conduct, reporting, or dissemination plans of our research. all data used were retrieved from existing public sources, as referenced. we plan to share this on social media, twitter, and blogs. to reproduce the result tables for the scenarios presented in report , we averaged over simulation runs with the same random number seeds as used in the original report. the simulations are run for days, with day being january . the simulated intervention period lasts for three months ( days), with some interventions extended for an additional days. in reality, interventions were in place for rather longer, which delayed the second wave but had little effect on deaths. the mitigation scenarios in report considered reproduction numbers of r = . and r = . . as highlighted by ferguson et al, the results we obtain here are not precisely identical to those in report because they are an average of stochastic realisations, the population dataset has changed to an open source one, and the algorithm used to assign individuals from households to other places such as schools, universities, and workplaces has been modified to be deterministic. we also count deaths in all waves, not just the first. the stochasticity gives a variance of around % in total number of deaths and icu bed demand between different realisations using different random numbers. more important is the uncertainty of the timing of the peak of the infections between realisations, which is around five days. we compared these predictions to the death rates from the actual trajectory of covid- . nhs england stopped publishing data on critical bed occupancy in march , so it was not possible to compare icu data from the model with real world data. table shows the demand for icu beds and table shows the total number of deaths; in both, the same mitigation scenarios as presented in report were used. as in report , for each mitigation scenario we considered a range of icu triggers. in table we report the peak icu bed demand across the full simulation for each trigger, as was presented in report , but we also include the peak demand for icu beds during the period of the intervention (first wave). the latter we define as the period during which general social distancing was in place when implemented. table reports the total number of deaths across the entire simulation as well as the number of deaths at the end of the first wave, again defined as the time at which general social distancing was lifted. table and table present the full simulation numbers, which are essentially the same as those presented in table a in report . table also illustrates the counterintuitive result that adding school closures to a scenario with case isolation, household quarantine, and social distancing in people older than years would increase the total number of deaths across the full simulation. moreover, it shows that social distancing in those over would be more effective than general social distancing. table and table show that in some mitigation scenarios the peak demand for icu beds and most deaths occur during the period when the intervention is in place. there are, however, other scenarios when the opposite is true. the reason for this is illustrated in figure . the mitigation scenarios of "do nothing," place closures, case isolation, case isolation with household quarantine, and case isolation, household quarantine, and social distancing of over s are as presented in figure of report . we also show some additional scenarios (case isolation and social distancing; case isolation, household quarantine, and general social distancing; and place closures, case isolation, household quarantine, and social distancing of over s) that are not shown in figure of report included in table and table and in the tables in report . in the simulations presented here, the main interventions are in place for three months and end on about day (some interventions are extended for an additional days). figure shows that weaker intervention scenarios lead to a single wave that occurs during the period in which the interventions are in place. hence the peak demand for icu beds occurs during this period, as do most deaths. stronger interventions, however, are associated with suppression of the infection such that a second wave is observed once the interventions are lifted. for example, adding place closures to case isolation, household quarantine, and social distancing of over s substantially suppresses the infection during the intervention period compared with the same scenario without place closures. however, this suppression then leads to a second wave with a higher peak demand for icu beds than during the intervention period, and total numbers of deaths that exceed those of the same scenario without place closures. we therefore conclude that the somewhat counterintuitive results that school closures lead to more deaths are a consequence of the addition of some interventions that suppress the first wave and failure to prioritise protection of the most vulnerable people. when the interventions are lifted, there is still a large population who are susceptible and a substantial number of people who are infected. this then leads to a second wave of infections that can result in more deaths, but later. further lockdowns would lead to a repeating series of waves of infection unless herd immunity is achieved by vaccination, which is not considered in the model. a similar result is obtained in some of the scenarios involving general social distancing. for example, adding general social distancing to case isolation and household quarantine was also strongly associated with suppression of the infection during the intervention period, but then a second wave occurs that actually concerns a higher peak demand for icu beds than for the equivalent scenario without general social distancing. figure provides an explanation for how place closure interventions affect the second wave and why an extra intervention might result in more deaths than the equivalent scenario without this intervention. in the scenario of case isolation, household quarantine, and social distancing of over s but without place closures, a single peak of cases is seen. the data are broken down into age groups, showing that younger people contribute most to the total number of cases, but that deaths are primarily in older age groups. adding the place closure intervention (and keeping all other things constant) gives the behaviour shown in the second row of plots. the initial peak is greatly suppressed, but the end of place closures while other social distancing is in place prompts a second peak of cases among younger people. this then leads to a third, more deadly, peak of cases affecting elderly people when social distancing of over s is removed. postponing the spread of covid- means that more people are still infectious and are available to infect older age groups, of whom a much larger fraction then die. one criticism of school closure is that reduced contact at school leads to increased contact at home, meaning that children infect high risk adults rather than low risk children. we investigated this by increasing the infection rate at home to an extremely high level. figure shows that this makes an insignificant difference compared with the overall effect of adding school closures (despite the description of place closure interventions in table of report , university closures are not included in the scenario parameter file representing place closure, case isolation, household quarantine, and social distancing of over s ) to the other interventions. description of a second wave in covidsim although report discusses the possibility that relaxing the interventions could lead to a second peak later in the year, we wanted to explore this in more detail using the newer covidsim code and latest set of parameter files. (summarised in table and table ). the scenario of place closures, case isolation, household quarantine, and social distancing of over s would minimise peak demand for intensive care but prolong the epidemic, resulting in more people needing intensive care and more deaths. these findings illustrate why adding place closures to a scenario with case isolation, household quarantine, and social distancing of over s can lead to more deaths than the equivalent scenario without place closures. doing so suppresses the infection when the interventions are present but leads to a second wave when interventions are lifted. in the model this happened in july , after a day lockdown: in practice the first lockdown was extended into august, so the second wave was postponed to september. the total number of deaths in the scenario of case isolation, household quarantine, and social distancing of over s is , whereas when place closures are included the total number is . similarly, comparing general social distancing with equivalent scenarios without social distancing, the second wave peak in the case isolation, household quarantine, and general social distancing scenario is higher than the first wave peak in the case isolation and household quarantine scenario. icu=intensive care unit; pc=place closures; ci=case isolation; hq=household quarantine; sdol =social distancing of over s; sd=general social distancing the interventions we consider are place closures, case isolation, household quarantine, and general social distancing, which are implemented using the pc_ci_hq_sd parameter file. specifically, we use the parameter file available in the data/param_files subdirectory of the github repository. the only modification was to change the duration of the interventions to days. these interventions start in late march (day ) and last for three months ( days). these simulations are also initialised so that about deaths occur by day ( april) in all scenarios, mostly in people infected before the interventions were implemented. initialisation is done by modifying the "number of deaths accumulated before alert" parameter in the preuk_ . .txt parameter file. this compares with how the report simulations were initialised, which used the reported deaths to march. the results are presented in figure . the top panel shows the cumulative number of deaths, using data from national records of scotland and connors and fordham, whereas the bottom panel shows icu bed demand per people. although our simulations include northern ireland, the available reported data do not. therefore, the simulation results and data presented in figure are only for england, wales, and scotland. we also consider a range of reproduction numbers and find that values higher than those considered in report best reproduce the data, with a value between . and . probably providing the best fit. this is consistent with the analysis presented in flaxman et al, but we acknowledge that the data could also be fitted by changes to the other scenario parameters. in both panels we also show the "do nothing" scenario for a reproduction number of . . the scenarios presented in figure are predicted to substantially reduce the demand for icu beds. the best fit to the code suggests about % infection rate in the first wave. random antibody testing at the time of writing (june ) suggests that about % of the population test positive for antibodies to coronavirus, although the large number of deaths in care homes suggest the post-lockdown first wave was concentrated in the over s age group. interventions are triggered by reaching cumulative intensive care unit cases. after the trigger, all the interventions are in place for days: the general social distancing runs to day and the enhanced social distancing for over s runs for an extra days. results are broken down into age categories, with social distancing of over s interventions affecting the three oldest groups. in the case isolation, household quarantine, and social distancing of over s scenario, a single peak of cases is seen, with greatest infection in the younger age groups but most deaths in the older age groups. in the place closures, case isolation, household quarantine, and social distancing of over s scenario, three peaks occur in the plot of daily cases, with the first peak appearing at a similar time to the other scenario, but with reduced severity. the second peak seems to be a response to the ending of place closure and mostly affects the younger age groups; therefore has little impact on the total number of deaths. the third peak triggered by relaxing social distancing of over s affects the older age groups, leading to a substantial increase in the total number of deaths editorial published in the bmj on september suggests this % could be an underestimate because iga antibodies and t cell immunity were overlooked. ) with only - % immunity after the lockdown, the epidemiological situation at the outset of the second wave is similar to that of march. consequently, the number of second wave infections is predicted to be similar to that of the first wave, with a somewhat lower death rate. in practice, it seems that mandatory and voluntary interventions short of a full lockdown will continue and maintain the reproduction number closer to this will mean slower exponential growth of the second wave and keep the peak demand for icu beds manageable, although since the epidemic is prolonged, the effect on total deaths is smaller. it is worth noting that a reproduction number of is also the value that prolongs the need for interventions for the longest. at this level, the inhomogeneity of transmissions, particularly the unpredictability of superspreading events, becomes critical. despite the level of detail in the model, the data are insufficient to model real people: we observed that for a major national epidemic, insufficient data introduce an uncertainty of about five days in the predictions. at a local level, and with a lower reproduction number, this uncertainty in the timing of the epidemic is greatly increased: it is impossible to predict when a particular town will experience an outbreak (specifically, different towns experience outbreaks in different runs of the code). in this paper we used the recently released covidsim code to reinvestigate the mitigation scenarios for covid- from ibmic presented in mid-march in report . the motivation behind this was that some of the results presented in the report suggested that the addition of interventions restricting younger people might actually increase the total number of deaths from covid- . we find that the covidsim code reliably reproduces the results from report and that the ibmic can accurately track the data on death rates in the uk. reproducing the real data does require an adjustment to the parameters and a slightly higher reproduction number than considered in report and implies an earlier start to the epidemic than suggested by the report. we emphasise that the unavailability of these parameters in early march is not a failure of the ibmic model. we confirm that adding school and university closures to case isolation, household quarantine, and social distancing of over s would lead to more deaths compared with the equivalent scenario without the closures of schools and universities. similarly, general social distancing was also projected to reduce the number of cases but increase the total number of deaths compared with social distancing of over s only. we note that in assessing the impact of school closures, uk policy advice has concentrated on reducing total number of cases and not the number of deaths. the qualitative explanation is that, within all mitigation scenarios in the model, the epidemic ends with widespread immunity, with a large fraction of the population infected. strategies that minimise deaths involve the infected fraction primarily being in the low risk younger age groups-for example, focusing stricter social distancing measures on care homes where people are likely to die rather than schools where they are not. optimal death reduction strategies are different from those aimed at reducing the burden on icus, and different again from those that lower the overall case rate. it is therefore impossible to optimise a strategy for dealing with covid- unless these three desirable outcomes are prioritised. we find that scenarios that are very effective when the interventions are in place, can then lead to subsequent waves during which most of the infections, and deaths, occur. our comparison of updated model results with the published death data suggests that a similar second wave will occur later this year if interventions are fully lifted. more realistically, if the case isolation, household quarantine, and social distancing of over s strategy is followed, alongside other nonpharmaceutical intervention measures such as nonmandatory social distancing and improved medical outcomes, the second wave will grow more slowly than the first, with more cases but lower mortality. since this paper was first written (june ), uk policy has moved to more local interventions. covidsim models the geography of all towns, but only the simulated people are representative of the true population. this uncertainty means that the model cannot reliably predict which town will experience an outbreak. specifically, whereas the timing of the national outbreak is uncertain by days, the timing of an outbreak in a town is uncertain by months. ibmic is the most precise model available, but substantially more personal data would be needed to obtain reliable local predictions. finally, we re-emphasise that the results in this work are not intended to be detailed predictions for the second wave of covid- . rather, we re-examined the evidence available at the start of the epidemic. more accurate information is now available about the compliance with lockdown rules and age dependent mortality. the difficulty in shielding care home residents is a particularly important set of health data that was not available to modellers at the outset. nevertheless, in all mitigation scenarios, epidemics modelled using covidsim eventually finish with widespread infection and immunity, and the final death toll depends primarily on the age distribution of those infected and not the total number. we thank kenji takeda and peter clarke for help with the covidsim code and neil ferguson for advice and sharing data. contributors: kr and bw ported and validated the code across several computer architectures, performed the calculations, and produced the figures. vm supervised the testing and preopensourcing test of the covidsim code. gja designed and supervised the project. all authors contributed to writing the paper. all authors act as guarantors. the corresponding author attests that all listed authors meet authorship criteria and that no others meeting the criteria have been omitted. funding: this paper was supported by a uk research and innovation grant st/v x/ under covid- initiative. this work was undertaken, in part, as a contribution to the rapid assistance in modelling the pandemic (ramp) initiative, coordinated by the royal society. the funders had no role in considering the study design or in the collection, analysis, interpretation of data, writing of the report, or decision to submit the article for publication. competing interests: all authors have completed the icmje uniform disclosure form at www.icmje.org/coi_disclosure.pdf and declare: support from uk research and innovation for the submitted work; no financial relationships with any organisations that might have an interest in the submitted work in the previous three years; no other relationships or activities that could appear to have influenced the submitted work. and connors and fordham. bottom panel shows demand for intensive care unit (icu) beds per people, including an unmitigated second wave. a range of reproduction numbers were considered and values higher than that considered in report were found to best reproduce the data. a good fit also requires an assumption that the epidemic started in january , earlier than was previously assumed in report . covidsim is seen to provide a good fit to the data with a reproduction number between . and . and predicts that the demand for icu beds would probably be limited to around per people imperial college london estimating the global infection fatality rate of covid- challenges in control of covid- : short doubling time and long delay to effect of interventions strategies for mitigating an influenza pandemic modeling targeted layered containment of an influenza pandemic in the united states codecheck certificate - covid- covidsim model covid- ) infection survey pilot: england and wales estimating the effects of non-pharmaceutical interventions on covid- in europe antibody prevalence for sars-cov- following the peak of the pandemic in england: react study in adults covid- : do many people have pre-existing immunity? interdisciplinary task and finish group on the role of children in transmission. modelling and behavioural science responses to scenarios for relaxing school closures ethical approval: not required.data sharing: the full simulation and datasets can be accessed and run from github using the sha hash code d c a ab d f f fdd a . code examples and raw data sufficient to reproduce all results in this research are available at https://doi.org/ . /ds/ . the lead author (gja) affirms that the manuscript is an honest, accurate, and transparent account of the study being reported; that no important aspects of the study have been omitted; and that any discrepancies from the study as planned have been explained.dissemination to participants and related patient and public communities: since this research uses public demographic data for the whole of the uk, there are no plans for dissemination of this research to specific participants, beyond publishing it.provenance and peer review: not commissioned; externally peer reviewed.this is an open access article distributed in accordance with the creative commons attribution non commercial (cc by-nc . ) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. see: http://creativecommons.org/licenses/ by-nc/ . /. key: cord- -kn njkqg authors: botha, andr'e e.; dednam, wynand title: a simple iterative map forecast of the covid- pandemic date: - - journal: nan doi: nan sha: doc_id: cord_uid: kn njkqg we develop a simple -dimensional iterative map model to forecast the global spread of the coronavirus disease. our model contains at most two fitting parameters, which we determine from the data supplied by the world health organisation for the total number of cases and new cases each day. we find that our model provides a surprisingly good fit to the currently-available data, which exhibits a cross-over from exponential to power-law growth, as lock-down measures begin to take effect. before these measures, our model predicts exponential growth from day to , starting from the date on which the world health organisation provided the first `situation report' ( january $-$ day ). based on this initial data the disease may be expected to infect approximately % of the global population, i.e. about . billion people, taking approximately million lives. under this scenario, the global number of new cases is predicted to peak on day (about the middle of may ), with an estimated million new cases per day. if current lock-down measures can be maintained, our model predicts power law growth from day onward. such growth is comparatively slow and would have to continue for several decades before a sufficient number of people (at least % of the global population) have developed immunity to the disease through being infected. lock-down measures appear to be very effective in postponing the unimaginably large peak in the daily number of new cases that would occur in the absence of any interventions. however, should these measure be relaxed, the spread of the disease will most likely revert back to its original exponential growth pattern. as such, the duration and severity of the lock-down measures should be carefully timed against their potentially devastating impact on the world economy. march , the world health organisation (who) characterised the outbreak of coronavirus disease (covid- ) as a pandemic, referring to its prevalence throughout the whole world . the outbreak started as a pneumonia of an unknown cause, which was first detected in the city of wuhan, china. it was reported as such to the who on the st december , and has since reached epidemic proportions within china, where it has infected more than citizens, to date. during the first six weeks of the disease spread to more than other countries, creating wide-spread political and economic turmoil, due to unprecedented levels of spread and severity. the rapid spread of covid- is fuelled by the fact that the majority of infected people do not experience severe symptoms, thus making it more likely for them to remain mobile, and hence to infect others . at the same time the disease can be lethal to some members of the population, having a globally averaged fatality ratio of . %, so far. it is most likely this particular combination of traits that has made the covid- outbreak one of the largest in recorded history. in late and early a similar outbreak took place with the occurrence of severe acute respiratory syndrome (sars). although the etiological agent of sars is also a coronavirus, the virus was not able to spread as widely as in the current case. one possibility why the sars outbreak was less devastating than the current outbreak is, paradoxically, due to its much higher fatality ratio (almost % globally), making it too severe to spread easily. while there are a number of models available for the global spread of infectious diseases , some even containing very sophisticated traffic layers , relatively few researchers are making use of simpler models that can provide the big picture without difficult to interpret unambiguously. in the latter category of relatively simple models we could find only a discrete epidemic model for sars , and more recently, a comparison of the logistic growth and susceptible-infected-recovered (sir) models for covid- . in our present work we develop a simple discrete -dimensional iterative map model, which shares some similarities with the classic sir model. we show that our model fits the currently-available global data for covid- . the fact that the available data for the pandemic can be fitted well by a simple model such as ours suggests that past and current interventions to curb the spread of the disease, globally, may not be very effective. as a model for the global data we use a -dimensional iterative map, given by where x i is the total number of confirmed cases, y i is the number of new cases and z i is the global population, on any given day i. we denote the only fitting parameter by α, while c is a fixed parameter equal to the fraction of people who have died from the disease. according to the latest available data from the who (see table in methods ), c = . . by using levenberg-marquardt (least squares) optimisation we find α = . , for the initial condition x = y = . and z = z = . × . we briefly describe the physical content of eqs. ( ). the first equation simply updates the total number of cases by setting it equal to the previous total number of cases, plus the number of new cases. here the factor of /z has been introduced for convenience, to ensure that the proportionality constant α remains close to unity. in the second equation the number of new cases is assumed to be proportional to the previous number of new cases multiplied by the previous number of susceptible people. the third equation keeps track of the global population by subtracting the estimated number of people who have died each day, based on the fraction c. figure shows a comparison of the data with the model, as well as a forecast made up to the th day. as we see in table ). the forecast made in figure (b) (corresponding to the last row of table ), predicts that approximately a quarter of the worlds population, i.e. ≈ . / . = . , would have had covid- by the th day. the peak of the pandemic is expected to occur on day , when about million daily new cases can be expected. we also predict that by the beginning of august , hardly any new cases should occur; however, the total number of lives lost by then could be as high as million. in table we see that the fitting parameter α, and hence the predictions made by the model, do change somewhat as more of the available data is used in the fitting procedure. to see the variation in α more clearly we have plotted the first and second columns of table in figure . it shows that, as more data is used, there seems to be a general upward trend in α, until day . at the same time the increase in α is not monotonic, since α appears to oscillate. while we are not sure, at this stage, whether α is converging, or whether it will continue to increase (or decrease) generally in the future, we note that the variations in α and the predictions made by the model, are relatively small over the last two weeks. thus it seems like, as more data is used to calculate α, the variations will become smaller -assuming that there is no systematic errors in the current or future data (see discussion). last day α x × max{y} × day of max{y} in figure we also plot (blue solid line) the mean valueᾱ = . over the last ten days. the oscillations of the calculated data points about this line give an indication of the uncertainty inᾱ over the last ten days. as a rough estimate of the figure . variation of the fitting parameter α as more and more of the available data is used in the fitting procedure. we see that the value of α seems to have stabilised over the last ten days, as discussed in the main text. uncertainties in the predictions made by the model, we also calculate the means and standard deviations of the other quantities that are given the last ten rows given in table . this results in x = ( . ± . ) billion, max{y} = ( ± ) million, peak day = ± , 'deaths' = ( ± ) million. while we realise that this method may not result in rigorous estimates for the uncertainties involved, we provide it here merely as rough estimate of the sensitivity of our simple model to the new data / coming in, as the pandemic continues. from the trend that can be seen in table , it seems that our current model actually provides a best case scenario prediction since, as more data becomes available, the resulting predictions become less and less optimistic, i.e. in terms of the total number of lives lost, etc. furthermore, as the disease spreads there will probably be many more unreported cases, either due to asymptomatic responses, or simple because the numbers are now becoming too large to manage (see, for example ref. ). in developing countries such as south africa there is also a relatively large percentage of people with compromised immunity, due to the high prevalence of human immunodeficiency virus (hiv), and this also could result in the coronavirus having a much larger impact than our model of the current global data shows. another factor to consider is the reliability of the who data itself. at present, this data is probably the most accurate we will ever have. however, as things progress, there will be a much greater chance of unreported cases, since people are now being instructed to contact the hospitals only if they experience severe symptoms. this means that all other cases are unlikely to be tested/confirmed. our present model does not take into account such details. one can of course try to answer more specific questions with a more sophisticated model, like the discrete model we mentioned for sars ; however, here we have been more interested in developing a very simple model that brushes over the details and only captures the essential, large scale behaviour. as we have already alluded to, our model may not be suitable for individual countries, because it does not include many factors that may be necessary to predict the spread of the disease in specific situations. additionally, one must realise that the population of one country is much smaller that the world's, and the initial interventions taken could range from minimal to very severe, as in italy and china, for example. in contrast with this, on a global scale, the population is essentially limitless, and it is nearly impossible to impose restrictions on everybody. hence, it is our contention that the virus will spread more naturally on a global scale, almost as if it were left completely unchecked. the direct human cost of such an unchecked spread could be truly devastating . on the one hand it could result in a catastrophic loss of tens of millions of lives, as our model predicts, but on the other hand, all the (possibly ineffective) measures being taken by individual countries to contain the virus, could also have fatal consequences. so far these measures have included enforced quarantine, which has led to a severe slowdown in economic activity and manufacturing production, principally due to declining consumption and disrupted global supply chains . (as an example of the severity of the slowdown in production, several major car manufacturers are gradually halting production in major manufacturing hubs throughout the developed world .) this decline, coupled with the associated economic uncertainty, has had knock on effects in the form of historically unprecedented stock market falls . although the stock market is more of an indicator of the future value of the profits of listed corporations, their collapsed share prices could trigger severe financial crises because of a spike in bankruptcies. (the debt of us corporations is the highest it has ever been .) the inevitable loss of jobs will also lead to an inability to pay bills and mortgages, increased levels of crime, etc. in principle, such a major decline in economic conditions could also result in a large-scale loss of life, which should be weighed carefully against the direct effects that the unimpeded global spread covid- could have. we have fitted our model to the data shown in table . for the reader's convenience, the complete python script for the optimisation is provided on the following page. in this script, the function leastsq(), imported from the module scipy.optimize , uses levenberg-marquardt optimization to minimize the residual vector returned by the function ef(). the function leastsq() is called from within main(), which reads in the data and sets up the initial parameter and the other two quantities (the initial values x[ ] and y[ ]) for optimisation. these three quantities are then passed to leastsq(), via the vector v . for the data in table , the output from the script should be: who-director-general-s-opening-remarks-at-the-media-briefing-on-covid covert coronavirus infections could be seeding new outbreaks insights from early mathematical models of -ncov acute respiratory disease (covid- ) dynamics gleamviz: the global epidemic and mobility model a discrete epidemic model for sars transmission and control in china estimation of the final size of the covid- epidemic an algorithm for least-squares estimation of nonlinear parameters test backlog skews sa's corona stats. (the mail and gardian virus could have killed million without global response. (nature news a covid- supply chain shock born in china is going global - coronavirus: car production halts at ford, vw and nissan - coronavirus: ftse , dow, s&p in worst day since a modern jubilee as a cure to the financial ills of the coronavirus - coronavirus disease (covid- ) situation reports python scripting for computational science a. e. b. would like to acknowledge m. kolahchi and v. hajnová for helpful discussions about this work. both authors wish to thank a. thomas for uncovering some of the related literature. new deaths a. e. b devised the research project and performed all the numerical simulations. both authors analysed the results and wrote the paper. the authors declare no competing interests. key: cord- - v q authors: pérez-cameo, cristina; marín-lahoz, juan title: serosurveys and convalescent plasma in covid- date: - - journal: eclinicalmedicine doi: . /j.eclinm. . sha: doc_id: cord_uid: v q nan the current pandemic is not only overwhelming the health systems of the affected countries but also is killing thousands of other ways healthy adults. convalescent plasma has been proposed [ ] and approved to treat covid- based on the experience acquired treating other viral diseases such as influenza, ebola, and sars [ ] . it is considered a safe treatment (at least its side effects and contraindications are well known) and it has proven to be efficacious in several viral infections for more than a century. currently, several countries and health institutions are trying to gather convalescent sera for either empirical treatment or clinical trials. based on the who interim guidance developed for the ebola outbreak [ ] , convalescent plasma has advantages over other proposed treatment: it requires low technology (and therefore it can be produced where required independent of pharmaceutical companies), it is low cost and its production is easily scalable as long as there are sufficient donors. in other words, relative donor scarcity can threaten any plan to massively produce treatments based on plasma. due to the exponential nature of the pandemic, the number of current patients is greater than the number of recovered patients at any given time until the peak is reached. the number of identified recovered patients equals approximately the number of identified active patients three weeks earlier minus the deaths. in a supposed population with a steady growth of identified contagions of % something similar to what is currently going on in several cities around the world the number of active patients is between and times greater than the number of identified recovered patients (potential donors). this means, in the best case scenario, there would be convalescent plasma available for À % of the identified current patients. the actual number of donors will probably be much lower as not every convalescent patient willing to donate would be suitable (a great proportion are elder and have comorbidities) and not every suitable potential donor will be willing to donate. fortunately, infection by sars-cov is not as lethal as that caused by ebola virus and many patients do not require treatment to overcome covid- . furthermore, the real number of convalescent patients may be much greater than the number based on the recovery of previously identified patients because of the existence of asymptomatic and mild infections. this might be especially true in countries where most of the incident infections had been acquired from not previously identified cases. this is currently the case in several western countries. the estimates from the imperial college london covid- response team also support the hypothesis that most of the cases go unrecognized. according to their calculations, among european countries (comprising million citizens) as of march th there were about million cases [ ] . populations that have been fully or randomly tested confirm the existence of asymptomatic infections [ À ] . accordingly, we propose two sources of donors, not frequently identified, should be explored: paucisymtomatic patients: and fully asymptomatic patients. serosurveys might identify as many donors as required for the growing number of patients who could benefit from convalescent plasma. serosurveys have been used to evaluate the immunity of some populations to infections such as ebola and sars after controlling the outbreaks. they are very useful to evaluate the susceptibility of a population and therefore to calculate the peak of a current or subsequent outbreak using the sir model (susceptible, infectious and recovered compartments) estimating the size of the recovered compartment. this in turn is used to decide health policies. to our knowledge, serosurveys have not been used to drive plasma donations. in these diseases, known to have greater mortality, a significant proportion of asymptomatics has been found seropositive among exposed populations [ , ] . to boost the ability of serosurveys to find potential donors, the approach could be modified to enrich the sample (albeit not obtaining representative data). targeting populations at high risk of exposure such as contacts or health workers and self-identification of potentially convalescent patients using questionnaires could easily lead to as many plasma donors as required before the number of contagions peaks. we declare no competing interests. no grants or financial support have been received for this work. convalescent plasma as a potential therapy for covid- hark back: passive immunotherapy for influenza and other serious infections use of convalescent whole blood or plasma collected from patients recovered from ebola virus disease for transfusion, as an empirical treatment during outbreaks. interim guidance for national health authorities and blood transfusion services. geneva, switzertland: who estimating the number of infections and the impact of non-pharmaceutical interventions on covid- in european countries estimation of the asymptomatic ratio of novel coronavirus infections (covid- ) covid- : identifying and isolating asymptomatic people helped eliminate virus in italian village spread of sars-cov- in the icelandic population serologic markers for ebolavirus among healthcare workers in the democratic republic of the congo antiÀsars-cov immunoglobulin g in healthcare workers key: cord- -xmq gm authors: cherednik, i. title: a surprising formula for the spread of covid- under aggressive management date: - - journal: nan doi: . / . . . sha: doc_id: cord_uid: xmq gm we propose an algebraic-type formula that describes with high accuracy the spread of covid- pandemic under aggressive management for the periods of the intensive growth of the total number of infections. the formula can be used as a powerful forecasting tool. the parameters of the theory are the transmission rate, reflecting the viral fitness and "normal" frequency of contacts in the infected areas, and the intensity of prevention measures. the duration of the period of intensive growth is essentially inversely proportional to the square root of the intensity of hard measures. a more precise formula is based on bessel functions. the data for the usa, uk, sweden, israel are provided. power law of epidemics. the simplest equation for the spread of communicable diseases results in exponential growth of the number of infections, which is mostly applicable to the initial stages of epidemics. see e.g. [ , , , ] here and below and [ ] about some perspectives with covid- . we focus on the middle stages, where the growth is no greater than some power functions in time, which requires a different approach and different equations. the equally classical logistic models of the spread, as well as the sir, sid models an generalizations, assume that the number of infections is comparable with the whole population, which we do not impose. major new epidemics were not really of this kind during the last years, which is obviously due to better disease control worldwide. the reality now is the power-type growth of total number of infections u (t) after a possible short period of exponential growth, covid- included. our approach is based on this assumption. generally, the rate of change of the total number of infections du (t)/dt is mostly related to u (t) − u (t − p), where p is the period when the infected people spread the virus in the most intensive way. assuming that u (t) = t α : du (t)/dt = αt α− and the leading term of u (t) − u (t−p) is pαt α− = pλu (t)/t, i.e. it is essentially proportional to du (t)/dt. however, if α > , there will be other terms in the expansion of u (t)−u (t−p) and the proportionality with u (t)/t can be only if either the virus transmission strength diminishes over time or we reduce our contacts over time when the total number of cases growths faster than linearly. with covid- , we attribute it mostly to the latter, to the reduction of the contacts over time of infected individuals with the rest of population, i.e. to behavioral and sociological factors. this is different from other power laws for infectious diseases; compare e.g. with [ ] . generally, if someone wants to "see" the trend of the epidemic using only the total number of infections to date, then u (t)/t is the best way of course. this is qualitative, but the corresponding differential equation du (t)/dt = c u (t)/t immediately gives that u (t) = ct c for some constant c, i.e. results in a power growth. this can be really seen for covid- and other epidemics before the active management begins. the coefficient c is approximately c ≈ for covid- initially; when the management begins, c drops and then the bessel functions describe the increase of u (t) till the "saturation", which is a technical end of the epidemic, though not its real end of course. some details. the full theory is presented in [ ] . if the number of new infections is proportional to the current number of those infected, then the exponential growth of the spread is granted. by analogy with news impact over time from [ ] , and similar sociological-type processes, it is quite likely that the number of such contacts is proportional to the current total number of the infected individuals to date divided by the time to date . the coefficient of proportionality is essentially the intensity of the spread. sociologically, it is related to the intensity of the discussion of the epidemic by the authorities in charge and everywhere, which directly influences our understanding the gravity of the situation and results in the reduction of our contacts. this can happen even before any active prevention measures begin. starting with the equation du (t)/dt = c u (t)/t for the total number of infections u (t), the coefficient of proportionality, the basic transmission rate c, is therefore a combination of the transmission strength of the virus and the "normal" frequency of the contacts in the infected place. there can be some other mechanisms for the "power law", including the biological ones. self-isolation of infected species is common not only for humans ... unless for rabies; it can grow over time and when the intensity of the spread increases. another possible mechanism can be due to the replication processes for viruses, but this is well beyond this article. anyway a sociological approach to the spread, which "explains" under some assumptions the power growth of the number of total cases, is quite natural in our work, because the active managements of epidemics is clearly of sociological nature, applicable only to humans. the coefficient c is one of the two main mathematical parameters of our theory. it can be practically seen as follows. before the prevention measures are implemented, approximately it reveals itself in the growth ∼ t c of the total number of detected infections, where t is time. for covid- , c is around , which results in the quadratic growth of total number of infections after a short period of its exponential growth. upon the active management of covid- , the growth quickly becomes t c/ , i.e. essentially linear, which is part of our theory in [ ] . this can be observed in many countries. ending epidemics. the "power law" is only a starting point of our analysis. the main problem is of course to "add" here some mechanisms ending epidemics and then prevent their possible recurrence. these are major challenges, biologically, psychologically, sociologically and mathematically. one can expect this megaproblem to be well beyond the power law itself, but we demonstrate that mathematically there is a path based on bessel functions. the power growth alone obviously cannot lead to any saturation. we are of course fully aware of the statistical nature of the problem, but the formula for the growth of the total number of infections we propose works almost with an accuracy of fundamental physics laws. this is very surprising for such stochastic processes as epidemics. . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not certified by peer review) the copyright holder for this preprint this version posted may , . . https://doi.org/ . / . . . doi: medrxiv preprint an important outcome of our modeling is that the measures of "hard type", like detecting and isolating infected people and closing the places where the spread is almost inevitable, are the key for ending an epidemic. moreover, such measures must be employed strictly proportionally to the current numbers of infections, not to its derivative of any kind, which is the most aggressive "momentum" way to react to them, the "hardest" possible as we will explain now. for instance, if the number of infections doubled during some period, then the operation formula from [ ] requires that testing must be increased -fold. assuming that we approach the saturation, where the number of new infections becomes almost zero, the "hard way" is to keep testing at the same level as before, i.e. performing the same number of tests every day. this is in spite of almost zero number of new infections. practically, this is not always the case. if we react to the average number instead of the absolute number of infections, then we are supposed to stop testing and other measures when there are no new "cases". this is the "soft way". if this is coupled with "soft" measures (see below), then mathematically the epidemic will never reach the saturation point. the u-formula. the intensity of "hard" measures will be denoted by a in this article. upon composing the corresponding differential equations and integrating them we obtain the following. the function for the bessel function j α (x) of the first kind models the growth total number of infections, where c is the scaling parameter necessary to adjust this formula to the actual numbers. here and below time t is normalized days/ for the number of days from the beginning of the period of the intense growth of the epidemic. this functions matches the total number of infections with high accuracy. the parameter a is . for the usa and uk; it can be about . for countries with somewhat more proactive approach, like austria and israel, and it becomes . and even . for "the world" and sweden. the basic transmission rates are: c = . for the usa, c = . for uk, sweden, and it is currently . for "the world". recall that c approximately corresponds to the growth ∼ t c of total infections in the beginning of the epidemic, when no protection measures are in place. limitations. importantly, the function u(t) is proportional to t c/ + for t near t = , which is very close to ∼ t c , assuming that c ≈ . so the phase of "parabolic growth" is generally covered well by the u-formula too, though formally this does not follow from the theory in [ ] . of course there can be other reasons for u(t) to "serve epidemics"; it is not impossible that there are connections to the replication process of viruses, but this we do not touch upon. as always, there are limitations, which we will address now. . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . https://doi.org/ . / . . . doi: medrxiv preprint first and foremost, the available infection numbers are for the detected cases, which are mostly symptomatic. however this is not of much concern for us. we understand managing epidemics sociologically. those who are detected mostly have symptoms, but the management is mostly focused on them too. so all our arguments fully applicable within this group; we just restrict ourselves with symptomatic cases. no assumptions on asymptomatic cases are necessary for the u-formula. we of course understand that when the number of new reported infections drops to zero, there can be many non-detected asymptomatic cases, which can potentially lead to the recurrence of the epidemic. such "saturation" is only a technical end of the epidemic. the second reservation concerns newly emerged clusters of infections and the countries where the spread is on the rise. the u-formula can be used in spite of such fluctuations, but it is now a statistical tool; see figure . the predictions must be regularly updated. the third reservation is related to the management of the epidemic. not all countries employ the protection measures in similar ways, but this is not a problem for us. the problem is if the intensity of the measures and the criteria are changed in some unpredictable ways. diminishing the "hard" measures too early or even dropping them completely at later stages is quite possible. last but not the least is the data quality . changing the ways they are collected and the criteria frequently makes such data useless for us. though, if the number of detected infections is underreported in some regular ways, whatever the reasons, such data can be generally used. thus this is sufficiently relaxed, but the data for several countries, not too many, are not suitable for the usage of our "forecasting tool". forecasting the spread. with these reservations, the first point of maximum t top of u(t) is a good estimate for the duration of the epidemic and the corresponding estimate for the top value of the total number of infections. this gives an important "forecasting tool", assuming that "hard measures" are the key in practical management of the epidemic. we note that the approximate reflection symmetry of du(t)/dt for the u(t) in the range from t = to t top can be interpreted as farr's law of epidemics under aggressive management. generally, the portions of the corresponding graph before and after the turning point are supposed to be essentially symmetric to each other. this is not exactly true for du(t)/dt, but close enough. see figure and the other ones; the turning point is at max{du(t)/dt}. as any model, our one is based on various simplifications. we assume that the number of people perceptive to the virus is unlimited, i.e. we do not consider epidemics with the number infections comparable with the whole population, as well as herd immunity . also, we disregard . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . https://doi.org/ . / . . . doi: medrxiv preprint the average duration of the disease and that for the quarantine periods imposed. the total number of infections regardless of the output is what we are going to model, which is commonly used. in spite of all these assumptions, the u-formula works unexpectedly well. we mention here a strong connections with behavioral finance ; see [ ] . for instance, practically the same u(t) in terms of bessel function as above models "profit taking" in stock markets. the polynomial growth of u(t) is parallel to the "power law" for share prices. there is a long history and many aspects of mathematical modeling epidemic spread; see e.g. [ ] for a review. we restrict ourselves only with the dynamic of momentum managing epidemics , naturally mostly focusing on the middle stages, when our actions must be as precise as possible. the two basic modes we consider are essentially as follows: (a) aggressive enforcement of the measures of immediate impact reacting to the current absolute numbers of infections and equally aggressive reduction of these measures when these numbers decrease; (b) a more balanced and more defensive approach when mathematically we react to the average numbers of infections to date and the employed measures are of more indirect and palliative nature. hard and soft measures. the main examples of (a)-type measures are: prompt detection and isolation of infected people and those of high risk to be infected, and closing places where the spread is the most likely. actually the primary measure here is testing ; the number of tests is what we can really implement and control. the detection of infected people is its main purpose, but the number of tests is obviously not directly related to the number of detections, i.e. to the number of positive tests . the efficiency of testing requires solid priorities, focus on the groups with main risks, and solving quite a few problems. even simple mentioning problems with testing, detection and isolation is well beyond our article. however, numerically we can use the following. during the epidemics, essentially during the stages of linear growth, which are the key for us, the number of positive tests can be mostly assumed a stable fraction of the total number of tests. this is demonstrated in figure . such proportionality can be seen approximately from march . however, at the end of this chart, testing was reduced below the levels required for mode (a); it must remain at least constant until the saturation of the number of total cases. the measures like wearing protective masks, social distancing, recommended self-isolation, restricting the size of events are typical for (b). this distinction heavily impact the differential equations we obtained in [ ] . however the main difference between the modes, (a) vs. . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . hard measures are the key. generally, the (a)-type approach provides the fastest possible and "hard" response to the changes with the number of infections, whereas we somewhat postpone with our actions until the averages reach proper levels under (b), and the measures we implement are "softer". mathematically, the latter way is better protected against stochastic fluctuations, but (b) is slower and cannot alone lead to the termination of the epidemic, which we justify mathematically within our approach. the main objective of any managing epidemics is to quickly end them. however the excessive usage of hard measures can lead to the recurrence of the epidemic, some kind of "cost" of our aggressive interference in a natural process. this can be avoided only if we continue to stick to the prevention measures as much as possible even when the number of new infections goes down significantly. reducing them too much on the first signs of improvement is a way to the recurrence of the epidemic, which we see mathematically within our approach. some biological aspects. the viral fitness is an obvious component of the transmission rate c. its diminishing over time can be expected, but this is involved. this can happen because of the virus replication errors. the rna viruses, covid- included, replicate with fidelity that is close to error catastrophe. see e.g. [ ] for some review and predictions. such matters are well beyond this paper, but one biological aspect must be mentioned, concerning the asymptomatic cases. the viruses mutate at very high rates. they can "soften" over time to better coexist with the hosts, though fast and efficient spread is of course the "prime objective" of any virus. such softening can result in an increase of asymptomatic cases, difficult to detect. so this can . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . https://doi.org/ . / . . . doi: medrxiv preprint contribute to diminishing c we observe, though this is not because of the decrease of the spread of the disease. we model the available (posted) numbers of total infections, which mostly reflect the symptomatic cases. to summarize, it is not impossible that the replication errors and "softening the virus" may result in diminishing c at later stages of the epidemic, but we think that the reduction of the contacts of infected people with the others dominates here, which is directly linked to behavioral science, sociology and psychology. . the formula for the growth.. we will need the definition of the bessel functions of the first kind: . the key point is that measures of type (a) have "ramified" consequences, in some contrast to mode (b). namely, an isolated infected individual will not transmit the virus to many people and the number of those protected due to this isolation grows over time. combining this with our understanding of the power law of epidemics, we arrive at the differential equation for the total number of infections u(t): there is a surprisingly perfect match of the total number of infections for covid- in the usa and uk till april with our solutions u(t) above. this is from the moment when these numbers begin to grow "significantly", approximately around march for the usa and uk. epidemics are very stochastic processes, so such a precision is surprising. the site https://ourworldindata.org/coronavirus is mostly used for data, updated at : london time. we take x = days/ . the usa data. the scaling coefficient . in figure is adjusted to match the real numbers. for the usa, we set y = infections/ k, and take march the beginning of the period of "significant growth". the parameters are c = . , a = . . the red dots show the corresponding actual total numbers of infections. they perfectly match u(t) = . t . j . (t √ . ) in figure , which results in the following: the number of cases in the usa can be expected to reach its preliminary saturation at t top = . ( . days . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . https://doi.org/ . / . . . doi: medrxiv preprint from / to may ) with u top = . , i.e. with infections (it was at / ). this is of course a projection: about m of total cases at the saturation point t top near may . the black dots show the test period, till april . the predictions are of course based on the assumption that the intensity of hard measures continues to be proportional to the total number of detected infections to date, as it was clearly the case for the red dots. the jumps like the one at about x = . can be because of various reasons impossible to forecast, as well as some period of linear growth after it. however the general trend for black dots matches our u(t) well enough. obviously t = t top cannot be the end of the epidemic. the data from south korea and those from other countries that went through the "saturation", demonstrate that a linear growth of the total number of cases can be expected around and after t top , with periodic fluctuations. the obvious reasons are (a) reducing the "hard" and "soft" measures, (b) no country is isolated from infections from other places, (c) continued testing can result in finding more asymptomatic cases, (d) new clusters of disease are always possible. anyway, fluctuations closer to the "saturation" and after it are quite likely. resuming the corresponding measures is possible, but this is not always the case. our formula is just a forecasting tool, which is mostly applicable to the period when the protection measures, especially "hard" ones, are applied in a regular manner. then the match with the real data is good. covid- in uk. one of the reasons of the strong match of red dots and our u(t) can be that the usa consists of states, and therefore the total number of infections is quite an average. however this is . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . https://doi.org/ . / . . . doi: medrxiv preprint no worse for uk. the data will be from / till / ; add to our "red dots", the initial number at / , to match the actual total numbers. the black dots constitute the control period: / - / . now c = . , a = . work fine, and the scaling coefficient is . . the total number of cases is divided by k, not by k as for the usa. the expectation is now as follows: the "saturation moment" can be . , i.e about days after march , somewhere around may . the estimate for corresponding number of infections is about , with all ifs. this is assuming that the "hard" measures will be employed at the same pace as before april . the black dots confirm the trend. sweden / - / . this is an example of the country that remains essentially "open". actually, they actively do testing followed by the isolation of infected people, the key "hard" measure from our perspective. also, the strength of the health-care in this country must be taken into consideration, and that it is surrounded by the countries that fight covid- aggressively. the growth of the total number of cases was essentially quadratic for a relatively long period, which is what "power law" states for the epidemics with minimal "intervention". by now, the growth is linear, which means that what the measures they use appeared working. interestingly, our u-formula is applicable, but for record law a = . ; it is a = . for "the world". here y is the total number of cases divided by and add , the initial value, to our "dots". the projected saturation is around may . as always, it is with a reservation about their future policies, which seem sufficiently stable by now. in spite of their "soft" approach, our u-formula obviously can be used; see figure . . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . israel: "saturation". the last example we provide is what can be expected when the country went through the "saturation". israeli population is diverse, which has a potential of significant fluctuations of the number of cases and various clusters of infection. however its solid response to covid- and good overall health-system, made the growth of the spread sufficiently predictable. generally, for small countries the fluctuations can be expected higher than for countries like the usa, uk, though it depends on how "homogeneous" they are. we divide the total number of cases by , as for sweden; see figure . the red dots began march , when the total number of detected infections was , and stopped april ; the remaining period till april , shown by the black dots, was the "control one". the saturation forecast went through almost perfectly, but there were significant fluctuations in process. after april , the predicted moment of the saturation, the growth of the total number of (known) infections is supposed to be mild linear. the parameters are: a = . , i.e. the intensity of hard measures is better than with the usa, uk, and c = . . the latter means that the initial transmission coefficient was somewhat worse than c = . in the usa, uk, possibly due to the greater number of "normal" contacts. recall that a, c are parameters of our theory, related to but not immediately connected with the real factors. some discussion. recall that we consider only "total cases", the numbers of all detected infections, and begin at the moment when the "significant" growth begins, which is also essentially when the active measures start. according to our theory, the match with u(t) above can be expected in some interval around the turning point. however . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . practically it holds better than this: an almost perfect match is for about % of the whole periods of intensive growth. our restriction to the period till april , with fixing the parameters a, c, except for sweden, is not accidental. it was sufficient for the practical confirmations of the theory presented in figures , , and with other countries. the latest data, the black dots, provide the "realtime checks"; they were obtained after the parameters were fixed. the pandemic is far from over, but within the scope of this paper, the data we needed for the "forecasting tool" were essentially present by / . when the epidemic approaches the saturation, which is the first maximum of u(t) in our model, its management can be expected to evolve toward reducing and abandoning "hard" measure. this can result in a growth like ct c/ with some "mild" c and c/ ≈ and significant fluctuations, which really happens. such growth at the end is generally covered by our theory; it is not connected with bessel functions. the superb match of our u(t) during the significant part of the period of intensive growth of the spread can be of real importance for practical managing epidemics. on the basis of what we see, the best ways to use the u-curves seem as follows: ( ): determine a, c using about the first - % after the intensive growth begin, ( ): update them constantly till the turning point and somewhat beyond, ( ): try to the adjust the measures at later stages "to stay close to the curve". with ( ), the constant response is needed to new clusters of the disease, some jumps with the new cases due to the reductions of the measures and so on. generally, our forecasting tool can serve the best if the data and the measures are as uniform and "stable" as possible. then underreporting, focusing on symptomatic cases, and inevitable fluctuations with . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not certified by peer review) the copyright holder for this preprint this version posted may , . . https://doi.org/ . / . . . doi: medrxiv preprint the data may not influence too much the match with the u-curves; this is of course statistically and with usual reservations. also, ( ): testing the population and some "soft" measure must be continued well after the "saturation" to prevent the recurrence of the epidemic. the mathematics of infectious diseases periodicity in epidemiological models power-law models for infectious disease spread epidemic psychology: a model modeling infectious disease dynamics momentum managing epidemic spread and bessel functions artificial intelligence approach to momentum risk-taking are rna viruses candidate agents for the next global pandemic? a review a treatise on the theory of bessel functions acknowledgements. i'd like to thank eth-its for outstanding hospitality. my special thanks are to giovanni felder, rahul pandharipande. i also thank very much david kazhdan for his valuable comments and suggestions. funding: partially supported by nsf grant dms- and the simons foundation. key: cord- -refvewcm authors: kache, tom; mrowka, ralf title: how simulations may help us to understand the dynamics of covid‐ spread. – visualizing non‐intuitive behaviors of a pandemic (pansim.uni‐jena.de) date: - - journal: acta physiol (oxf) doi: . /apha. sha: doc_id: cord_uid: refvewcm the new coronavirus sars‐cov‐ is currently impacting life around the globe ( ). the rapid spread of this viral disease might be highly challenging for health care systems. this was seen in northern italy and in new york city for example( ). governments reacted with different measures such as shutdown of all schools, universities and up to a general curfew. all of those measures have a huge impact on the economy. the united nations secretary general has stated recently: “the covid‐ pandemic is one of the most dangerous challenges this world has faced in our lifetime. the most dangerous challenges this world has faced in our lifetime. it is above all a human crisis with severe health and socio-economic consequences" . according to the european mortality observing network : "pooled mortality estimates from the euromomo network continue to show a markedly increased level of excess all-cause mortality overall for the participating european countries, coinciding with the current covid- pandemic" (see figure ). there are many aspects that need to be considered when considering measures such as lockdowns. one of the main problems is that aspects of the dynamical behavior are hard to grasp with our thinking in the context of our experiences. our understanding of cause and relationship is mostly related to "linear thinking". this means that there is a linear relationship between a potential influence and a result. an example would be "if you buy twice as many apples you have to pay twice the amount of money." this concept however does not work for exponential growth rates. those are governed by rules that are highly sensitive to the conditions of the underlying process. in the context of a viral disease we can consider the concept of growth rate describing the factor that describes the number of affected individuals divided by the number the previous day. for example, if you change the growth rate in an ideal exponential scenario from . by % to . you will get after only days more than twice and after days more than times the amount (see figure ). in the case of covid it has been estimated that the reproductive number can be as large as . in this particular study. the exponential growth however could only be found in a situation where there are an infinite number of individuals that can get viral disease. this is obviously not the case. exponential-like growth can be observed only in the beginning of the spread, where the number of people that do not have the disease is a lot bigger in comparison to the infectious individuals. in order to understand the spread and aspects of the dynamics of the disease we can use a modeling approach . here we explore an agent-based model where dots as surrogates for individuals move around in a given space. the simulation can be found at https://pansim.uni-jena.de. each of the dots has one of the following states: susceptible, infectious without symptoms, infectious with symptoms, recovered, and immune. once infected, the dots go through the disease cycle described by the states. the dots move around in the space with a given "mobility". once the dots get aware of the disease they change their mobility. the ultimate goal would be to eradicate the sars- virus worldwide. this is unlikely to achieve on a shortterm scale due to its pandemic characteristic. when the pandemic runs through the population a key figure is the maximum number of active cases. since a certain percentage of the active cases will require this article is protected by copyright. all rights reserved intensive care the maximum number of active cases determines whether or not the healthcare system runs at its limit or not. in order to model different theoretical scenarios, we have implemented predefined parameter settings. the different scenarios described in the following relate a "default" parameter setting that is used to compare the modifications. if the population has been in contact with the virus previously. in scenario iii, a third is immune to the disease, whereas in scenario iv two thirds is. as expected the maximum number of active cases is drastically reduced. scenario v: sick people remain mobile: if sick people remain mobile, the maximum number of active cases will increase. this suggests that it is beneficial for slowing the spread that the number of contacts of an infectious person low. scenario vi: infection detected earlier: this would reflect a situation where testing of the disease is available on a large scale to identify infections individuals. in the model, infected agents reduce their mobility earlier and therefore the transmission rate is reduced. here we simulate the case that a given number of individuals have a much higher mobility than the rest of the simulated population. this high mobility is kept at all times in the simulation. the simulation shows much higher values for the maximum number of active cases. scenario viii and ix: lock down with a threshold: these scenarios describe a population-wide lockdown based on a threshold of active cases at a given time. the lockdown measure reduces the maximum this article is protected by copyright. all rights reserved number of active cases in the model. if the lock-down is imposed at a lower threshold the effect is stronger. a second value allows setting, to which fraction of the lockdown threshold the number of active cases needs to drop in order to lift the lockdown. as pointed out earlier, the behaviour of the spread is highly sensitive to the initial parameters. that is why it is difficult to predict the spread and hard to estimate when to impose drastic measures such as lockdowns. there are however features in the model that are important to note and give consistent results over a broad range of parameter sets. for example that if one reduces the infection probability; the speed of the spread is slowed down. that means that any measure that leads to the reduction of that probability would reduce the maximum number of active cases and hence would help to reduce the risk of pushing the health care system to its limit. a complicating issue with covid- is that a substantial number of people show only mild or even no symptoms , hence remaining undetected while spreading the virus. one other problem of controlling the disease is that there are delays in the underlying biology. it is known systems that have time delays in the feedback loop might exhibit oscillatory behaviour . this is what we see in the model (figure ) for the active cases. understanding the dynamical aspects of the disease is crucial for proper control. it is therefore important to educate the public about the unintuitive behavior of the spreading of a disease like covid- . the agent-based model might be a good tool to communicate the epidemiological characteristics. the model makes it possible to explore the effects of the different parameters on the behaviour of the spread and key outcomes such as peak number of active cases or total number of affected individuals. the input parameters can be changed interactively on the website. according to the model one effective way to slow down the disease would be to reduce the contacts of infectious individuals. in this regard it is discussed whether it might be helpful to trace contacts and to test for the disease in a targeted manner . the world health organization (who) has defined priorities for research related to covid- . among those top priorities is "the natural history of the virus, its transmission and diagnosis" . from an epidemiological point of view the way of transmission has a prominent role. for example, it would have a tremendous impact on the prevention strategy if the virus would be transmitted over food or drinking water such as e. coli bacteria or sapovirues . the model also shows that super mobile individuals that have contact to many others contribute to a faster spread if they are infectious. this article is protected by copyright. all rights reserved it appears to be certain that the topic of covid- with currently . . confirmed cases and . total deaths worldwide (as of th of may ) will engage our interest for the next months. however one should not forget that other health topics are also on the table. according to the latest numbers: "tobacco kills more than million people globally every year. more than million of these deaths are from direct tobacco use and around . million are due to non-smokers being exposed to secondhand smoke" , this is all due to an behaviour that could been stopped immediately, at least in theory. agent based model: the agent-based model (abm) presented here is based on the idea of the classical sir model and expanded by additional states. the sir model describes the spread of an infectious disease in a population using three population states quantities: susceptible (s), infectious (i), and recovered (r) . the dynamics of the sir model is governed by different transition rates between the states, that can be explicitly modeled by a system of differential equations. pansim, on the other hand, simulated a population of entities (agents) that behave individually according to simple rules in a simulation space. in contrast to a mathematically formulated model, such rules can be easily understood and yet can produce similar outcomes despite the stochastic nature of abms . a simple abm can approximate the sir model by using three instructions that each agent obeys: i.) move through the simulated space, ii.) if a susceptible agent is near an infected agent, change state from susceptible to infected, iii.) if an agent is infected and a certain time has passed, change the state to recovered (see fig. ). behavior rules can be programmed to reflect individual behavior of agents even more realistically. this makes it possible to produce complex interactions between agents, which may lead to unexpected, emergent behavior from those simple rules , . technical implementation: pansim was written in javascript using the p .js and plotly.js libraries. p .js is a graphical library that allows for simple creation of visual elements in javascript. build-in functions of p .js were used to create graphical representations of the agents within the pansim simulation.p .js is built around the 'draw()' main loop function, which is periodically called and is used to progress the simulation and the graphical representation at a specified frame rate. plotly.js has been employed for generating live-updating plots of the data produced during the model run. the pansim model uses object-oriented programming to produce individual agents with the 'person' object. 'person' describes the states agents can assume (susceptible, infected, recovered, immune), how this article is protected by copyright. all rights reserved they behave, and how agents are graphically represented. the 'town' object takes user input and builds a corresponding population of agents and passes down user-defined behavior parameters to the agents. thus, users of pansim can change simulation parameters as desired. after the model parameters are set, the simulation starts and updates the agents position using a random walk by iteratively calling p .js's 'draw()' function. after each update interval, the agents check if they are in contact with an infected agent, i.e. if the distance between any agent is less than their radius, and change their status accordingly. the probability of infection upon contact can be specified, so that not every contact leads to an infection. if an agent is infected, an internal timer is incremented after each update interval. an infected agent progresses from an infectious but asymptomatic time interval (time infectious without symptoms, tiwos) to an infectious time interval with symptoms (tiws) . agents can change their mobility during the symptomatic time interval. the user can set for how many simulation updates the tiwos and tiws phase should last. in addition to the three classic states from the sir model, a population of super mobile agents can be included. those agents have three times mobility of susceptible agents throughout the model run despite their infection state. the user has the option to introduce a population-wise lock down by specifying a threshold of symptomatic agents. in that case, all agents reduce their mobility except super mobile agents. the number of susceptible (grey), symptomatic (red) and total number of infected agents (black) is counted after each update interval and displayed in a live-updating plotly.js graph. the fraction of the population that got infected and the maximum fraction of active cases is displayed in the control panel of pansim to allow users to compare different simulation outcomes. conflict of interest: none. : exponential growth is highly sensitive to its parameters. in this example the growth rate of . (blue) has been increased only by % (red). figure : behaviour of the model with a lockdown criterion based on a threshold (screenshot from https://pansim.uni-jena.de). in this setting a second lockdown was required. notice after imposing the lockdown the number is still increasing, in this case by a factor of two to three. this overshoot is much smaller in the second lock-down phase. this is due to the fact that a considerable fraction got infected and turned to the recovered/immune stage. if however this fraction is small, steep increase would be possible again. a: sars-cov- : what do we know so far? acta physiologica team: coronavirus: the first three months as it happened statement by the secretary-general -on covid- w: preliminary prediction of the basic reproduction number of the wuhan novel coronavirus -ncov centre for mathematical modelling of infectious diseases, c-wg: early dynamics of transmission and control of covid- : a mathematical modelling study estimating the asymptomatic proportion of coronavirus disease (covid- ) cases on board the diamond princess cruise ship sh: death by delay south korea is reporting intimate details of covid- cases: has it helped? nature, . . who news relaease: world experts and funders set priorities for covid- research it: two drinking water outbreaks caused by wastewater intrusion including sapovirus in finland who statement: tobacco use and covid- j: an open-data-driven agent-based model to simulate infectious disease outbreaks lm: brief introductory guide to agent-based modeling and an illustration from urban health research km: agent-based modeling in public health: current applications and future directions processing simplicity times javascript flexibility, (www document this article is protected by copyright. all rights reserved key: cord- -qx kvn u authors: zhu, hongjun; li, yan; jin, xuelian; huang, jiangping; liu, xin; qian, ying; tan, jindong title: transmission dynamics and control methodology of covid- : a modeling study date: - - journal: appl math model doi: . /j.apm. . . sha: doc_id: cord_uid: qx kvn u the coronavirus disease (covid- ) has grown up to be a pandemic within a short span of time. to investigate transmission dynamics and then determine control methodology, we took epidemic in wuhan as a study case. unfortunately, to our best knowledge, the existing models are based on the common assumption that the total population follows a homogeneous spatial distribution, which is not the case for the prevalence occurred both in the community and in hospital due to the difference in the contact rate. to solve this problem, we propose a novel epidemic model called seir-hc, which is a model with two different social circles (i.e., individuals in hospital and community). using the model alongside the exclusive optimization algorithm, the spread process of covid- epidemic in wuhan city is reproduced and then the propagation characteristics and unknown data are estimated. the basic reproduction number of covid- is estimated to be . , which is far higher than that of the severe acute respiratory syndrome (sars). furthermore, the control measures implemented in wuhan are assessed and the control methodology of covid- is discussed to provide guidance for limiting the epidemic spread. a total of , confirmed cases have been reported worldwide as of : cet on mar. . the statistical data shows that the covid- outbreak constitutes an enormous threat to human in the world. many factors, including human connectivity, urbanization, as well as international travel, pose difficulties for prevention and control of covid- [ ] . fortunately, mathematical models offer valuable tools for understanding epidemiological patterns and then for decision-making in global health. however, modeling the transmission dynamics of epidemic is still a challenging task, since the usefulness of a mathematical model depends on the existence of its solution [ ] . a prime difficulty is to obtain the reliable data due to the fact that the available data is often patchy, delayed, and even wrong [ ] [ ] [ ] [ ] . the second cause is that the classical epidemic models describe epidemic transmission in the absence of interventions, which scarcely occurs in the real world. daily activities and travel, which are tightly linked to the spread of infection, tend to make the case more complex. the last but not least reason is that there are too many incidences, which cannot be ignored anyhow, happened in hospital where the contact rate is entirely different from that in the community. in this case, the general epidemic models do not work. additional problem encountered here is related to the estimation of the parameters of the epidemic model [ ] . in order to probe propagation characteristics and transmission mechanism, several attempts have been done on modeling the transmission dynamics of covid- , using observations of the index case in wuhan [ ] [ ] [ ] [ ] . unfortunately, the models fail to consider the difference of contact rate between in communities and in hospitals. this may limit the accuracy of the proposed models and hence the reliability of results. and hence, the results are very different from one another and even contradictory. most of them indicate that the basic reproduction number of covid- is lower than that of sars. on the contrary, the number of the individuals infected by covid- to date has been far higher than that infected by sars ( , ) in [ ] . moreover, there is no comparison of model output with real-world observations, though such a comparison is necessary to establish the performance of the models. to resolve this problem, this paper presents a novel epidemic model, which based on the standard assumption: the population is divided into susceptible (s), exposed (e), infectious (i), and recovered (r) groups [ ] . the model proposed here, referred to as seir-hc for simplicity, extends the susceptible-exposed-infectious-removed (seir) model to handle epidemic transmission in two different but coupled social circles. furthermore, a two-step optimization is built for the parameter estimation. with the limited data, transmission process of covid- in wuhan city reproduced by the proposed model. then, the propagation characteristics and unreported data are estimated, and the end time of covid- is also predicted. in the end, the control measures implemented in wuhan are assessed and the control methodology of covid- is discussed to provide guidance to cease the covid- outbreak. the remainder of the paper is structured as follows: section introduces the previous work; section defines the related terminology; section explains the seir-hc model in detail; section describes the two-step optimization for parameter estimation; section shows the analysis results, and finally, section states the conclusions. transmission dynamics of an epidemic disease have long been a topic of active research. in , farr [ ] mathematically described the cattle-plague epidemic by means of curve fitting method. in , p. d. en'ko presented a discrete-time model in which the population consists of infectious individuals and susceptible individuals [ ] . in , hamer [ ] proposed that the incidence depends on the product of the densities of susceptibles and infectives. the susceptibleinfectious-removed (sir) model was formulated by martini in [ , ] . in , kermack and mckendrick [ ] investigated the sir model in a homogeneous population by using differential equations and discovered that the epidemic course would inevitably be terminated once the density of population becomes smaller than a critical threshold. stone et al. [ ] analyzed the rationale of the pulse vaccination strategy with the sir model. the sir model is ideally suited to describe the process of virus spread. however, it is not always congruent with epidemic course. some infections do not confer the lasting immunity. in this case, one should resort to the sis model, in which the infectious becomes susceptible again, rather than recovered. the sis model has many potential applications such as modeling the spread behavior of computer viruses [ ] . on the other hand, if the epidemic lasts for a very long time, births and deaths heavily affect the population size. so, hethcote [ ] introduced births and deaths into the deterministic models of sis and sir. on account of the fact that the sis and sir model only hold for the case without a latent period, which is not the case for many kinds of infectious diseases. to alleviate this problem, cooke [ ] proposed an epidemic model for the case that after a fixed time the infected susceptibles can get infectious. in this model, the population is divided into four classes of individuals: susceptibles, exposed individuals, infectives and recovered individuals. such an epidemic model is known as the seir model. later, longini [ ] presented a general formulation of the discrete-time epidemic model with permanent immunity. lipsitch et al. [ ] estimated the infectiousness of sars and likelihood of an outbreak with modified seir model. krylova and earn [ ] found that the dynamical structures of seir models have a less effect on the stage duration distribution, relative to that of the sir models. indeed, a family of seir models is being used to support epidemic control, elimination, and eradication efforts. see [ ] and [ ] for an extensive review of classic epidemic models. during the disease spread, a high degree of chance enters into the conditions under which fresh infections take place. therefore, statistical fluctuations should be taken into account for a more precise analysis [ ] . for this reason, bailey [ ] introduced probability distribution into the sis epidemic model. in the standard models, the incubation and infectious periods are typically assumed to be exponential distribution, which makes the model sensitive to stochastic fluctuations [ , ] . however, because of more robustness, weibull distribution is used to investigate sars by marc lipsitch et al [ ] . for the stochastic case, even the simplest representations present difficulties in obtaining algebraic solutions [ ] . to address this issue, saunders [ ] constructed an approximate maximum likelihood estimator for the chain seir model by using the poisson approximation to the binomial distribution. recently, wu et al. [ ] provided an estimate of the size of the covid- epidemic based on the seir model. using an assumption of poisson-distributed daily time increments, read et al. [ ] fitted a deterministic seir model to the daily number of confirmed cases in chinese cities and cases reported in other countries/regions. zhao et al. [ ] estimated the number of unreported cases in mainland china based on the assumption that the initial growth phase followed an exponential growth pattern. to sum up, a number of models have been proposed to formulate the transmission process of epidemic disease, which lay a good foundation for our work. unfortunately, none of these studies have been done on the epidemic transmission under two different but coupled conditions. such a mathematical model is desired to explore, understand, predict and anticipate covid- , including changes caused by intervention. in many cases, canonical representation will be the starting point for obtaining a clear or concise expression. for this reason, let us introduce the notation used throughout the paper before describing our work. fig. presents a pictorial display of disease natural history in order to make these terms easy to be understood. ( ) latent period: time from exposure to onset of infectiousness, during which the infectious agent develops in the vector, and at the end of which the vector can infect a susceptible individuals [ ] ; ( ) incubation period: time from exposure to first appearance of clinical symptoms of infection [ ] ; ( ) infectious period: period during which an infected person can transmit a pathogen to a susceptible [ ] ; ( ) length of stay: time from the day of admission in the hospital to the day of discharge, i.e., the number of days a patient stayed in a hospital for treatment [ ] ; ( ) serial interval: time from the onset of symptoms in an index case to the onset of symptoms in a subsequent case infected by the index patient [ ] ; ( ) susceptibles: individuals who have possibility of contacting with infectious individuals but still stay uninfected [ ] ; ( ) exposed individuals: individuals in the latent period, who are infected but not yet infectious [ ] ; ( ) infectives: individuals who are infectious in the sense that they are capable of transmitting the infection [ ] ; ( ) removed individuals: individuals who are removed by recovery and death, among them recovers obtain permanent infection-acquired immunity [ , ] ; ( ) quarantined individuals: suspected or exposed individuals who are separated and controlled to see if they become sick [ ] ; ( ) isolated individuals: infectives who are separated and controlled to avoid disease transmission [ ] ; ( ) basic reproduction number: expected number of secondary infectious cases generated by an average infectious case in an entirely susceptible population [ ] ; ( ) contact rate: average number of individuals with which one infective have an adequate contact in unit time, where an adequate contact is an interaction which results in infection [ , ] ; ( ) incidence rate: rate of new infections [ ] , or, more precisely, total number of exposed individuals who move into infective class in unit time; ( ) death rate: the death probability of a person per day on average [ ] . based on the basic principle that if the model is tightly close to the real world, then the optimization algorithm will converge to the most reasonable solution, we build a new seir model with intervention mechanism which takes two different social circles into consideration. to estimate the parameters, the model of a complex physical situation tends to involve a certain amount of simplification for real world applications. however, it is not straightforward to obtain a balance between simplicity and practicality. to resolve this dilemma, generality of epidemic and particularity of covid- are considered simultaneously in our work and then a set of assumptions are determined carefully as follows. ( ) the population is homogeneous and uniformly mixed [ , , ] . ( ) recovered individuals are permanently immune and newborn infants have temporary passive immunity to the infection [ ] . ( ) infectiousness remains constant during an infectious period [ ] . ( ) the natural disease-independent death rate is constant throughout the population [ ] . ( ) the disease-caused death rate is a time-independent constant. ( ) latent period and infectious period follow weibull distribution [ ] , which is a versatile distribution that has the ability to take on the characteristics of other types of distribution. ( ) contact rate is constant over the entire infectious period [ ] . ( ) the first index case is infected on dec , [ ] . ( ) travel behavior was not affected by disease before lockdown on jan , [ ] . consider a time interval ( , + ℎ], where h represents the length between the time points at which measurements are taken, here h = day. for convenience, a variable x at a time interval ( , + ] is represented as ( ) . then, the variables and parameters are denoted as follows: ( ) : size of population at time t, that is, total number of susceptibles, exposed individuals, infectives and removed individuals at time t; ( ) : number of inbound travellers every day in wuhan at time t; ( ) : number of outbound travellers every day in wuhan at time t; : contact rate in the community; ℎ : contact rate in hospitals; ( ) : incidence rate of the exposed individuals who are infected days ago, which follows the weibull distribution; : shape parameter of the weibull probability density function (pdf) of incidence rate ; : scale parameter of the weibull pdf of the incidence rate ; ( ) : removal rate of infectives by disease-caused death or recovery who have been infectious for days, which follows the weibull distribution; : shape parameter of the weibull pdf of the removal rate ; : scale parameter of the weibull pdf of the removal rate ; : proportion of hospitalized infectives to total number of infectives; : proportion of quarantined susceptibles to total number of susceptibles; : maximum of latent period; : maximum of infectious period; : disease-independent death rate. for notational convenience, index c is used to denote community and h hospital in the following expressions. the variables without the indices c and h mean they are applicable in both cases. from classical seir model, the population is roughly classified as four classes: susceptible, exposed, infectious, and recovered individuals. among them, exposed and infectious individuals fall into a series of groups according to disease progression in our work so that the weibull distribution, which armed with high generalization capability [ ] , can match accurately number of individuals and duration time. in order to accommodate the quarantine and isolation measures, and take the infectivity difference between hospital and community into consideration, the standard seir structure should be modified as shown in fig. . note that the size of population is varied with control measures and is also affected by the inbound and outbound travellers. for the individuals respectively in hospital and community, disease transmissions are different in infection pattern but share the common nature of the virus. for this reason, two populations are considered simultaneously and analyzed separately for the sake of accuracy. with the reservations mentioned in section . , we used the seir-hc model to simulate the epidemic process in wuhan. from the rules of node dissemination, the dynamic transfer equations of the seir-hc model are stated as follows. for the individuals in the community, given contact rate , ( ) / ( ) is the average number of susceptibles with which per infective contacts in unit time, and thus ( ) ( ) / ( ) is the total number of susceptibles with which ( ) infectives contacts in unit time (i.e., the total number of new infections in unit time). notice that the population of wuhan city was suddenly reduced from million to million before lockdown on the morning of jan , . on the day, the preventive and control measures of category a infectious diseases were implemented to fight against covid- . therefore, we assume that ( ) , ( ) , ( ) , ( ) , ( ) decrease proportionately by / and then the number of susceptibles became ( ) / for restrictions on outdoor activities on the same day. we further assume that the disease-independent death rate is . × − , which is the same as that reported by the wuhan government in december [ ] . according to the data presented by wu et al [ ] , , are set to , , , before jan , , , , from jan to jan , and , after jan , respectively. for the individuals in hospital, where depending on the data provided by li et al (in fig. ) , there are no more than new cases every day from dec to dec . at the same time, the basic reproduction number of the covid- in wuhan must be more than or else the outbreak is impossible [ , ] . in addition, a susceptible can be infected within seconds of standing next to an infective. for these reasons, it is highly probable that the parameter z, which reflects the force of infection of the huanan seafood market, is small and hence let z= . as defined in equation ( ), the number of susceptibles in hospital is equal to the number of infectives stayed in hospital. in fact, ( ) + ℎ( ) in ( ) is substituted by the number of the hospitalized patients in wuhan city, which is reported by wuhan municipal health commission (wmhc) [ ] and health commission of hubei province(hchp) [ ] . in addition, the execution of ( ) and ( ) is a time-consuming process. so, they are reformulated by it is readily seen from equation ( ) that the exposed individuals can be classified as belonging to one of + groups. and, the sizes of these groups are changed every day. the update formula is in this section, we propose a solution based on constraint optimization to estimate the parameters of the seir-hc model mentioned in section . the epidemic model is devised to estimate the unobserved variables and to predict the transmission process. the output data of the model is desired to be close to the real. unfortunately, it is difficult for the lack of the credible observations. to alleviate this problem, reasonable data is considered as an alternative. the data collection is completed by integrating multiple data sources: ( ) number of new cases every day from dec , , to jan , , provided by li et al (in fig. ( ) number of new cases every day from feb to mar , , reported by wmhc [ ] and hchp [ ] , ( ) number of the infectives among the nationals who returned to america [ ] , japan [ ] , south korea [ ] and singapore [ ] from jan to jan , (as tabulated in table i ( ) numbers of discharged and dead patients reported by wmhc [ ] and hchp [ ] , sum of which is theoretically equal to the number of the individuals removed by both recovery and death. note that, among them, several numbers were corrected for the contradiction between accumulated cases and new cases. based on the above data, the objective function consists of the five error sums of squares as follows: where ( ) = ∑( ( , ) + ℎ( , ) −̂( , ) ) note that the sign (^) denotes recorded value. here, ̂( ) +̂( ) indicates the total of the infectious hospital staff members by feb , . statistically, the nationals in wuhan city can be regarded as the samples of population in the community. the ratio of the cases confirmed during the time from jan to jan among nationals is approximately equal to that on jan . in this sense, the number of infectives is likely to be about in wuhan city on jan . hence, ̂( ) is set to . the parameter estimation is, in general, a complex task partly due to so many unknown parameters: , , , ℎ , , , , . correct convergence is hard to reach unless a reliable initial guess is provided. therefore, an optimum initial estimation based on knowledge is crucial in this case. for this reason, all the parameters are dichotomized into two classes. the parameters , , , ℎ belong to class one and the others class two. among the first-class parameters, , are determined by intervention, , ℎ reflect the combined effect of virus and control. the second-class parameters are related to the characteristic of the virus. to estimate the parameters of the first class, we first investigated the implementation of control measures in wuhan city. from the report of hchp [ ] , we found that an absolute increment of the number of confirmed infectives was more than thousand on feb , compared to about one thousand the day before that day. this implies that many patients are likely to fail to be hospitalized before feb . taking into account the strict restriction on outdoor activities imposed before, and are set at . and . . the permanent resident population of wuhan city is about million[ ]. moreover,wuhan is well known for being the transport hub of china. the contact rate was set for the infectiousness in the community. on the other side, ideally, infections hardly spread in hospital. but, this is not the case for the inadequacy of medical supplies. and, according to the report of hchp, a lot of medical supplies had been distributed after feb . from these points under consideration, the initial contact rate ℎ is set equal to . . at the same time, the contact rate ℎ is assumed to decrease by % after feb . with regard to class two, the parameters are relevant to the weibull pdf of the latent period and the infectious period. according to the medical records of patients at zhongnan hospital of wuhan university, the median hospital stay is days [ ] . li et al [ ] declared that the mean incubation period is . days. assuming that the latent period and the infectious period is approximately equal to the incubation period and the length of stay, the mean values of the latent period and the infectious period are preliminarily estimated to be and days, respectively. the profiles of the weibull pdf with various shape parameters are shown in fig. . as a result, the initial guesses of the parameters , , , are . , . , . and . , respectively. similarly, we determined the low bound and upper bound of the parameters. it is worth noting that to cut down the domain may not be a good strategy for global optimization since that may block the way to the global optimal point. global solutions are usually difficult to locate, whereas the situation may be improved when constraints are add [ ] . consequently, an inequality constraint is defined to allow algorithms to make good choices for search directions. based on the knowledge that solation and quarantine are a useful control measures [ ] , the numbers of exposed and infectious individuals in the presence of control efforts are consequentially no more than those in the absence of interventions. in theory, if only the mean values of the latent period and the infectious period are unchanged, the result is almost invariable in number. in this sense, for a given mean value, the second-class parameters only affect the shape of the seir-hc model. as a result, we explored a two-step optimization by adopting the sequential quadratic programming (sqp) method. in the first step, the first-class parameters are estimated by the sqp method. here, the results are taken as the initial guesses of the first class in the second step. in the second step, the second-class parameters are determined by the same way. the above process is iterated many times. the flow chart is presented in fig. . in brief, the complete process of the two-step parameter optimization can be divided into four steps: ( ) guess all the parameters and their low bound and upper bound as described in section . . ( ) estimate the first-class parameters using the sqp method with the objective function presented in section . and inequality constraints stated in this section. ( ) estimate the second-class parameters in the same manner with the first-class parameters updated in step ( ) . ( ) repeat the computing process from ( ) to ( ) until the bias is small than a given threshold value or cycle index reaches . (a) (b) fig. . weibull pdf with various shape parameters while (a) mean value is , and (b) mean value is . after the two-step iterative optimization, all the parameters of the seir-hc model are determined for covid- in wuhan. the low bound, the upper bound, the initial value and the estimated results are summarized in table ii . all results fall within the range between the upper and lower limits. some are very close to the initial value, but some far from the guess. fig. . flow chart of the complete two-step iterative optimization. with the parameters estimated, it is easy to derive that the mean and variance of the latent period are . and . , and those of the infectious period are . and . . the weibull pdfs of the latent period and the infectious period are shown in fig. . it can be observed that a large proportion of exposed individuals become infectious in a short time and most cases are mild. the cumulative probability of the latent period for days is up to . %, by which the -day period of active monitoring is well supported. the difference between the latent period and the incubation period is equal to . days and the difference between the infectious period and the length of stay is . days. it is worth noting that − (= . %) infectives, which is equivalent to, on the average, . days per infective, still stayed in the community before feb . assuming that the infectiousness is constant during the entire infectious period, the basic reproduction number is up to . where everyone is susceptible. the basic reproduction number estimated here is compared with others in table iii . furthermore, even though an infective is hospitalized at the beginning of clinical symptoms, he still can make . individuals infected. as a result, the outbreak is inevitable in the absence of interventions for the difference between the latent period and the incubation period. of course, this is not necessary true for districts other than wuhan city, because basic reproduction number varies from population density and social enthusiasm besides characteristics of pathogenic bacteria [ ] . with the seir-hc model proposed here, the transmission process of covid- epidemic in wuhan city is reproduced as shown in fig. . in our baseline scenario, we estimate that the outbreak would be over before apr , , and the total of infectives no longer increases by mar , . at the same time, the total number of removed individuals would reach finally (as shown in fig. (a) ). among them, the number of hospital staff members would be up to , which is likely to be slightly more than the reported cases due to asymptomatic infection. the number of the infected individuals in wuhan on jan , , is estimated to be , which is less than the count estimated by wu et al [ ] . the number of infectives on jan is ( . % difference to ), which is much higher than (the number of the infected individuals estimated by nishiura et al [ ] ). number of infectious hospital staff members on feb is equal to ( . % difference to ). it is readily seen from fig. (b) , (c) and (d) that there is a sudden decrease of the number of the exposed individuals on jan and on feb , which implies control measures launched have a conspicuous effect on the infection rate. though mathematical models of epidemic transmission often scarce contrast model output with real-world observations [ ] , the comparison is necessary to demonstrate the performance of the models and the validity of the results. for this reason, fig. provides a pictorial display of comparison of the estimated data and the reported ones. the two curves show the same trend after feb . but, before that day, the number of infectious individuals estimated here is far more than that reported by wmhc and hchp. a probable reason for this is the underreporting of incidences before feb [ ] [ ] [ ] [ ] , which is the main reason for the underestimation of basic reproduction number. from fig. (a) , it seems that there is a delay of the estimated number of the removed individuals relative to the reported. the underlying cause of the delay may be the fact that the discharge time is later than the end time of infectious period for functional recovery. in this sense, it appears to be appropriate that the outbreak in wuhan terminates later than the expected time. however, the warmer weather is helpful in preventing the virus from reproducing. given these points, the outbreak is likely to be ended as we expected before if the control measures are kept on as usual. additionally, it also can be observed from fig. (a) that, in the worst case, up to infectives failed to be hospitalized on jan . to assess the control measure, a series of experiments are carried out using the seir-hc model. the control measures are simulated by the first-class parameters and the effect is captured by the number of infectives. since the proportion of quarantined susceptibles to the total number of susceptibles − implies control level, the function of quarantine is tested by adopting . from the seir-hc model, we can see that impacts primarily on the community infection. so, only ic-t and (ic+rc)-t graphs are shown here for space limitations. fig. (a) shows varing when increases from . to . and fig. (b) shows + . fig. (c) and (d) display the results with a delay of ~ days. it was clear that the number of infectives dramatically increases with the proportion and delay time. as a result, control measures played a key role in prevention of the spread. it also can be found from fig. that the resulting divergence begins after jan . therefore, jan , on which the measures imposed, is the right time to stop the outbreak. given the proportion of hospitalized infectives to total number of infectives , − is equivalent to the proportion of the average time during which infectives stay in the community to the average infectious period, which reflects how quickly the infectives are allowed to hospitalize. in this section, the effect of the time of hospital admission is test by changing parameter . fig. (a) shows that the number of infectives in the community varies with and fig. (b) is the corresponding accumulative value. fig. (c) and (d) depict the case in hospital. the delay of hospital admission makes an increase in the number of infectives both in the community and in hospital. however, the increase of the number of infectives slows down when is small enough due to the depletion of susceptibles in the population. it can be observed from fig. (b) that, as a result of a large number of inbound travellers every day, the accumulative total of the infectives can be even more than permanent resident population. in fact, this is almost impossible because the implicit assumption that travel behavior was not affected by epidemic is not valid. contact rate is primarily determined by the nature of pathogen. in fact, it can also be changed in some degree by intervention strategies [ ] . assuming that c is varied from . number of infectives in the community varying with c is demonstrated in fig. . a clear result is that the number of infectives in the community increases exponentially with c . note that, following the same idea mentioned in section . . , only the number of infectives in the community is shown here. the contact rate h partly reflects the prevention level in hospital. from fig. , the number of infectives in hospital increase sharply with ℎ . according to the experimental results, it is easy to find out about the control methodology. the basic principle is taking measures as early as possible to lower , c , h and to enhance . the control measures of covid- list as follows. ( ) keeping in home quarantine and reducing travel to lower ; ( ) tracing, testing and quarantining the suspected case, immediately isolating symptomatic individuals and speeding hospital admission to enhance ; ( ) strengthening personal protection to lessen c and h . this work provides the seir-hc model, a novel seir model with two different social circles. if let = , the seir-hc model is easy to degrade into the standard seir model. for exploring the transmission dynamics of covid- , a two-step optimization method exclusively designed for parameter estimation of seir-hc model. with the model, the spread process of covid- is reproduced clearly even without enough observation data. the latent period, infectious period and basic reproduction number of covid- are estimated to be . , . and . , respectively. obviously, covid- is highly transmissible. furthermore, the outbreak in wuhan is anticipated to be over before apr , . the total number of removed individuals would reach finally. among them, the number of hospital staff members would be up to . according to the seir-hc model, the principle of prevention and control of covid- is taking measures as early as possible to lower , c , h and to enhance . a set of measures such as quarantine have significant impacts on lessening the spread. by the way, an international effort is required to prevent virus transmission since covid- has spread all over the world. as a whole, the conclusions are well interpretable and reasonable. as evidenced by the success in estimation and prediction, the seir-hc model is useful. although the results are based on the data from wuhan and hence they are not necessary to be reliable for other cities, the seir-hc model is valid everywhere, which allows us to capitalize on new data streams and lead to an ever-greater ability to generate robust insight and collectively shape successful local and global public health policy. modeling infectious disease dynamics in the complex landscape of global health generalization of epidemic theory. an application to the transmission of ideas the rate of underascertainment of novel coronavirus ( -ncov) infection: estimation using japanese passengers data on evacuation flights early transmission dynamics in wuhan, china, of novel coronavirus-infected pneumonia estimating the unreported number of novel coronavirus ( -ncov) cases in china in the first half of january : a data-driven modelling analysis of the early outbreak reporting, epidemic growth, and reproduction numbers for the novel coronavirus ( -ncov) epidemic severe acute respiratory syndrome coronavirus (sars-cov- ) and corona virus disease- (covid- ): the epidemic and the challenges estimation of the transmission risk of the -ncov and its implication for public health interventions novel coronavirus -ncov: early estimation of epidemiological parameters and epidemic predictions. medrxiv nowcasting and forecasting the potential domestic and international spread of the -ncov outbreak originating in wuhan, china: a modeling study impact of severe acute respiratory syndrome (sars) on travel and population mobility disease extinction and community size: modeling the persistence of measles on the cattle plague the first epidemic model: a historical note on epidemic disease in england-the evidence of variability and of persistency of type an age-structured model of pre-and post-vaccination measles transmission effects of the infectious period distribution on predicted transitions in childhood disease dynamics a contribution to the mathematical theory of epidemics theoretical examination of pulse vaccination policy in the sir epidemic model. mathematical and computer modelling the n-intertwined sis epidemic network model. computing asymptotic behavior and stability in epidemic models stability analysis for a vector disease model the generalized discrete-time epidemic model with immunity: a synthesis transmission dynamics and control of severe acute respiratory syndrome the mathematics of infectious diseases some evolutionary stochastic processes a simple stochastic epidemic analysis of an seirs epidemic model with two delays an approximate maximum likelihood estimator for chain binomial models the impact of methicillin resistance in staphylococcus aureus bacteremia on patient outcomes: mortality, length of stay, and hospital charges global dynamics of a seir model with varying total population size global stability for the seir model in epidemiology infinite subharmonic bifurcation in an seir epidemic model gaussian class multivariate weibull distributions: theory and applications in fading channels transmission dynamics of the etiological agent of sars in hong kong: impact of public health interventions notification on pneumonia of the new coronavirus infection reported by wuhan health committee bulletin of hubei provincial health committee on pneumonia caused by novel coronavirus quarantine of evacuees at march air reserve base ends two new cases of asymptomatic infection in japan were the third group of people evacuated from wuhan to japan details: a seventh case of novel coronavirus infection has been confirmed in the republic of korea past updates on covid- local situation three departments: a number of measures to care for anti-epidemic frontline medical staff clinical characteristics of hospitalized patients with novel coronavirusinfected pneumonia in wuhan numerical optimization complexity of the basic reproduction number. emerging infectious diseases key: cord- - xz xbh authors: hens, niel; vranck, pascal; molenberghs, geert title: the covid- epidemic, its mortality, and the role of non-pharmaceutical interventions date: - - journal: eur heart j acute cardiovasc care doi: . / sha: doc_id: cord_uid: xz xbh covid- has developed into a pandemic, hitting hard on our communities. as the pandemic continues to bring health and economic hardship, keeping mortality as low as possible will be the highest priority for individuals; hence governments must put in place measures to ameliorate the inevitable economic downturn. the course of an epidemic may be defined by a series of key factors. in the early stages of a new infectious disease outbreak, it is crucial to understand the transmission dynamics of the infection. the basic reproduction number (r( )), which defines the mean number of secondary cases generated by one primary case when the population is largely susceptible to infection (‘totally naïve’), determines the overall number of people who are likely to be infected, or, more precisely, the area under the epidemic curve. estimation of changes in transmission over time can provide insights into the epidemiological situation and identify whether outbreak control measures are having a measurable effect. for r( ) > , the number infected tends to increase, and for r( ) < , transmission dies out. non-pharmaceutical strategies to handle the epidemic are sketched and based on current knowledge, the current situation is sketched and scenarios for the near future discussed. the world has not seen an epidemic that turned into a pandemic without adequate medicinal products since the h n pandemic in (spanish flu). , there are important similarities as well as key differences. importantly, covid- is not influenza, it is worse. covid- has a wide spectrum of clinical severity, ranging from asymptomatic to critically ill, and ultimately death. [ ] [ ] [ ] [ ] a common and prominent complication of advanced covid- is acute hypoxaemic respiratory insufficiency or failure requiring oxygen and ventilation therapies. , a key difference between covid- and seasonal influenza is the very different reproduction number, b, [ ] [ ] [ ] a key quantity that, together with the recovery rate, k, drives the evolution over time of the susceptible, infected and recovered fractions, s(t), i(t) and r(t), respectively. a graphical depiction of the simple socalled sir model is given in figure . if b < . , the epidemic dies out quickly. if b > . , the infected fraction evolves towards a peak before decreasing again. as can be seen from figure , the initial evolution of the infected fraction is roughly exponentially shaped, prior to reaching the peak. current-day modelling may involve additional compartments (e.g. susceptible -exposed -infected -recovered -susceptible) and factor in as much information as possible from other sources, such as contact information, data from serological surveys, etc. [ ] [ ] [ ] the reproduction number is very different between seasonal influenza, where it is usually around . , and covid- , where it is estimated at about . if medication nor vaccines are available, and no nonpharmaceutical interventions are implemented. , [ ] [ ] [ ] this was the number estimated, for example, in the early phases of the hubei epidemic. a few other examples are as follows: for measles, the reproduction number is about - , for mumps it is roughly and for sars around . . [ ] [ ] [ ] [ ] [ ] an important task for the epidemiologist is to estimate b, especially in a newly emerging viral epidemic such as caused by sars-cov- . epidemiologists use the concept of infectious period, which in itself needs to be estimated from accruing data; they also use the contact rate, and finally the mode of transmission. for covid- , the dominant mode of transmission was established quickly as airborne droplets, while other routes such as faeces are possible. , for the infectious period, reliable data need to be available. also, it is not a constant, but depends on various factors, such as age, for example. values of five days for the latency period and five days for the infectious period have been put forward, as well as four days for the serial interval, shorter than the incubation period and hence suggesting substantial pre-symptomatic transmission. we will turn to the remaining quantity, the contact rate, soon. a key aspect is that the 'recovered' fraction also includes deaths. this requires careful attention from a public health standpoint. a death rate of, say, . - . % translates in a population of million people to , - , deaths. it is not just the case fatality rate (or the infection fatality rate) that causes distress and disruption, but evidently also the numbers needing intensive care or mechanical ventilation at a given point in time -the critically ill category. the contact rate is the quantity we can and should have an impact on, especially in the absence of vaccines and treatment. [ ] [ ] [ ] [ ] there are three possible strategies. the first one is suppression. it essentially means that the reproduction number is forced below . by imposing very severe contact restrictions on the population, as was done in china minus hubei. this is the quickest way to put out the fire. of course, a large fraction of the population is then kept in the susceptible state, and measures should be in place to avoid the epidemic from flaring up, while monitoring very effectively so that, if it does, suppression measures can be enacted again. clearly, china is in this situation, and likely will be until vaccines and medication are available. cheap, widespread, sensitive and specific diagnostic tools help maintain control. their quick development is also crucial. the second strategy is mitigation. here, measures are taken to bring the reproduction number down to a level at which the epidemic is slowed sufficiently so that the number of critically ill cases at any time, t, can be handled by the health care system. it can be supplemented by a temporary capacity increase of the system (e.g. field hospitals, annexes to existing hospitals). the measures taken in belgium aim to lower the reproduction number so that the health system can appropriately deal with covid- patients. it is not merely numbers but also the severity of cases, even when non-fatal. because of an epidemic's initial exponential growth, even when it is off to a slow start, it is unfortunately true that small causes, such as lockdown parties, can have severe consequences. in addition, the measures will have the required effect if the population is truly closed, or part of a larger population with exactly the same population dynamics. boundary effects, such as transnational contacts (e.g. between norway and sweden) can fatally undermine the mitigation strategy. further, the earlier that contact rates are drastically reduced (severe social distancing), the better. the closer we come to cutting off the virus's transmission mode, the sooner we will change, and hence flatten, the curve. the third strategy, or absence thereof, is counting solely on herd immunity (group immunity). at face value, this appears to be a sensible strategy. it will typically produce a shorter epidemic than with mitigation, and afterwards the population will be immune at group level. that is, the fraction of recovered people (and hence immune for a certain time, e.g. the rest of the season) will be so large that a re-emerging virus will not find enough susceptible population members to push the reproduction number above . , and the epidemic will soon extinguish. however, the area under the curve will increase, leading to considerable increase in critically ill patients and deaths. figure shows the effect of reducing the contact rate, or not. philadelphia ignored the warnings of an influenza epidemic among soldiers and organized a wwi-related parade. they closed the city a few days later, when all hospitals were filled to capacity. mass events were evidently also prevalent in the early stages of the covid- pandemic. st. louis implemented what we now term social distancing immediately after detecting the first two cases. the number of deaths per capita was double in philadelphia relative to st. louis. additionally, philadelphia's health care system was completely overwhelmed, while st. louis was able to cope with the epidemic, which killed about million people worldwide. figure depicts what happens if we move from a philadelphia to a st. louis scenario. the total volume of the epidemic will reduce, as the total fraction of infected population members is roughly equal to - / b, but a much more important effect is that the number of infected cases at any point in time remains below the (perhaps enhanced) capacity of the health care system. recall that the number of cases is not relevant when considered in isolation. much more important is the number of critical cases, and the fatality rate. two very important remarks apply. first, the number of actual cases is very different from the number of confirmed cases. china implemented rigorous measures, as did south korea, to identify cases. in europe, this has been difficult to varying degrees during the epidemic onset and peak period. undercount ratios are very different from country to country, implying that epidemiologists need to estimate the actual number of cases from the number of confirmed cases. there are ways to do so, but it adds further uncertainty to the predictions made. second, the infection fatality rate will increase if the health care system is overwhelmed, as well as by the extent to which it does. estimation of the infection fatality rate is difficult because of the large group of asymptomatic and undiagnosed cases. for the case fatality rate, figures around % have been quoted, although some authors suggest much higher rates if longer time delays were to be taken into account. the infection fatality rate has been estimated to range over . - . %. the immune fractions for austria and the netherlands have been estimated to be around % and %, respectively. estimates based on samples from blood donors, for example, might slightly underestimate the quantity. for belgium, - deaths against the background of - % immunity would suggest an infection fatality rate (ifr) of - %. likely, the death rate is overestimated due to a very inclusive definition of covid- related deaths. should the original reproduction number of . be maintained, in an unmitigated scenario, and assuming an ifr of %, then roughly % would be infected, leading to , deaths. for a reproduction number of . , roughly one-third of the population would become infected, leading to , deaths. mid-april , estimates of the reproduction number in various european countries indicate that it dropped below . , due to social distance measures. the larger the immune fraction, the easier to contain the epidemic in the future. but this comes at the cost of a severely overwhelmed health care system. this can be avoided, and apparently has been, by drastic social distance measures. what will happen next? for this, it is important to recall a few key differences from influenza. anderson et al. compare both on four aspects. first, the infection fatality rate is different and likely higher (about . % for influenza). second, there is infectiousness before the onset of symptoms. current partial knowledge suggests a period of - days before onset, roughly like influenza. third, with covid- , there may be up to - % of mild or asymptomatic cases. fourth, while influenza has a short infectious period of a couple of days, for covid- , although still relatively uncertain, it might be around days. anderson et al. conclude that this produces a slowly emerging epidemic, which then accelerates, only to last longer than an influenza epidemic. using mathematical modelling, kissler et al. examine scenarios for the time period ahead, based on current knowledge, as well as realistic but as yet unverified scenarios based on knowledge from beta coronaviruses oc and hku , including the immune period, whether or not reinfection can take place, seasonality, crossimmunization with these other coronaviruses, and the length and severity of lockdown measures. in the absence of pharmaceutical interventions, depending on the scenario, annual, biennial, or even five-yearly outbreaks are expected. such model-based predictions, even when there is considerable uncertainty, can support policy makers in developing a resilience strategy for the period until sufficiently adequate pharmaceutical interventions are possible. these may involve several time-related social distancing measures, preparedness to re-enter lockdown for certain periods, establishing quarantine procedures for individuals and groups, controlling contact between populations, within and between countries, et cetera. the measures taken are intimately linked to strategies aimed at building up some herd immunity in a controlled fashion. the authors have no conflicts of interest to declare. projecting the transmission dynamics of sars-cov- through the postpandemic period. science . epub ahead of print the effect of public health measures on the influenza pandemic in u.s. cities clinical characteristics of coronavirus disease in china a novel coronavirus emerging in china -key questions for impact assessment covid- -navigating the uncharted clinical features of patients infected with novel coronavirus in wuhan, china clinical course and outcomes of critically ill patients with sars-cov- pneumonia in wuhan, china: a single-centered, retrospective, observational study clinical characteristics of hospitalized patients with novel coronavirusinfected pneumonia in wuhan, china early transmission dynamics in wuhan, china, of novel coronavirus-infected pneumonia incubation period and other epidemiological characteristics of novel coronavirus infections with right truncation: a statistical analysis of publicly available case data on behalf of imperial college covid- response team. impact of non-pharmaceutical interventions (npis) to reduce covid- mortality and healthcare demand modeling infectious disease parameters based on serological and social contact data heterogeneity in estimates of the impact of influenza on population mortality: a systematic review transmission potential of the novel coronavirus (covid- ) onboard the diamond princess cruises ship china coronavirus: six questions scientists are asking pathogenicity and transmissibility of -ncov-a quick overview and comparison with other emerging viruses a mathematical model for simulating the phase-based transmissibility of a novel coronavirus does sars-cov- has a longer incubation period than sars and mers preliminary prediction of the basic reproduction number of the wuhan novel coronavirus -ncov serial interval of novel coronavirus (covid- ) infections mitigation strategies for pandemic influenza a: balancing conflicting policy objectives feasibility of controlling covid- outbreaks by isolation of cases and contacts the lancet respiratory medicine. covid- : delay, mitigate, and communicate how will country-based mitigation measures influence the course of the covid- epidemic? rethinking herd immunity real estimates of mortality following covid- infection optimizing agentbased transmission models for infectious diseases the authors received no financial support for the research, authorship, and/or publication of this article. key: cord- -j r veou authors: sipetas, charalampos; keklikoglou, andronikos; gonzales, eric j. title: estimation of left behind subway passengers through archived data and video image processing date: - - journal: transp res part c emerg technol doi: . /j.trc. . sha: doc_id: cord_uid: j r veou crowding is one of the most common problems for public transportation systems worldwide, and extreme crowding can lead to passengers being left behind when they are unable to board the first arriving bus or train. this paper combines existing data sources with an emerging technology for object detection to estimate the number of passengers that are left behind on subway platforms. the methodology proposed in this study has been developed and applied to the subway in boston, massachusetts. trains are not currently equipped with automated passenger counters, and farecard data is only collected on entry to the system. an analysis of crowding from inferred origin–destination data was used to identify stations with high likelihood of passengers being left behind during peak hours. results from north station during afternoon peak hours are presented here. image processing and object detection software was used to count the number of passengers that were left behind on station platforms from surveillance video feeds. automatically counted passengers and train operations data were used to develop logistic regression models that were calibrated to manual counts of left behind passengers on a typical weekday with normal operating conditions. the models were validated against manual counts of left behind passengers on a separate day with normal operations. the results show that by fusing passenger counts from video with train operations data, the number of passengers left behind during a day’s rush period can be estimated within [formula: see text] of their actual number. public transportation serves an important role in moving large numbers of commuters, especially in large cities. transit performance is an important determinant of ridership, and transit services that offer short and reliable waiting times for commuters offer a competitive alternative to driving, which contributes to reduced congestion and improved quality of life. crowding is a major challenge for public transit systems all over the world, because it increases waiting times and travel times and decreases operating speeds, reliability, and passenger comfort (tirachini et al., ) . studies show that crowding in public transit increases anxiety, stress, and feelings of invasion of privacy for passengers (lundberg, ) . the covid- pandemic has also highlighted the public health risks associated with passenger crowding in transit vehicles. although transit ridership dropped precipitously during the pandemic in cities around the world, concerns about crowding on transit continue as economies re-open, commuters return to work, and agencies plan for the future. when overcrowded, commuters may not be able to board on the first train or bus that arrives. these commuters are left behind the vehicle that wished to board, and their number is directly related to various basic performance measures of public transportation there are a number of technologies that can be used to observe, count, and track pedestrians and pedestrian movements in an area. digital image processing for object detection is an appealing approach for transit systems because surveillance videos are already being recorded in transit stations for safety and security purposes. the video feed records passenger positions and movements in the same way that a person would observe them, as opposed to infrared or wireless signal detectors that merely detect the movement of a person past a point or their proximity to a detector. the detection of objects in surveillance videos is an invaluable tool for passenger counting and has numerous applications. for example, object detection can be used for passenger counting or tracking, recognizing crowding, and hazardous object recognition. in a relevant application, velastin et al. ( ) uses image processing techniques to detect potentially dangerous situations in railway systems. computer vision is the duplicate of human vision aiming to electronically perceive, understand and store information extracted from one or more images (sonka et al., ) . there are various techniques to use computers to process an image for object detection by extracting useful information. recent methods use feature-based techniques rather than segmentation of a moving foreground from a static background, which was used in the past. then, the detected features are extracted and classified, typically using either boosted classifiers or support vector machine (svm) methods (viola, ; cheng et al., ) . svm is one of the most popular methods used in object detection algorithms and especially passenger counting, because it offers a method to estimate a hyperplane that splits feature vectors extracted from pedestrians and other samples (cheng et al., ) , differentiating pedestrians from other unwanted features. boosting uses a sequence of algorithms to weight weak classifiers and combine them to form a strong hypothesis when training the algorithm to attain accurate detection (zhou, ) . current methods for object detection take a classifier for an object and evaluate it at several locations and scales in a test image, which is time-consuming and creates numerous computational instabilities at large scales (deng et al., ) . the most recent methods, such as region based convolutional neural network (r-cnn), use another method to decrease the region over which the classifier runs and includes the svm. first, category-independent regions are proposed to generate potential bounding boxes. second, the classifier runs and extracts a fixed-length feature vector for each of the proposed regions. finally, the bounding boxes are refined by the elimination of duplicate detections and rescoring the boxes based on other objects on the scene using svms (girshick et al., ) . the bounding box is a rectangular box located around the objects in order to represent their detection (coniglio et al., ; lézoray and grady, ) . the resulting object detection datasets are images with tags used to classify different categories (deng et al., ; everingham et al., ) . an open-source software tool called you only look once (yolo) uses a different method than the above-mentioned techniques for object detection. it generates a single regression problem to estimate bounding box coordinates and class probabilities simultaneously by using a single convolutional network that predicts multiple bounding boxes and class probabilities for these boxes (redmon, ; redmon et al., ) . another advantage of yolo is that, unlike other techniques such as svms, it sees the entire image globally instead of sections of the image. this feature enables yolo to implicitly transform contextual information to the code about classes and their appearance and at the same time makes yolo more accurate, making fewer than half the number of errors compared to fast r-cnn . yolo uses parameters for object detection that are acquired from a training dataset. yolo can learn and detect generalizable representations of objects, outperforming other detection methods, including r-cnn. the ability to train yolo on images has the potential to directly optimize the detection performance and increase the bounding box probabilities . the calibration of parameters for object detection using an algorithm like yolo requires training datasets with a large number of tagged images. although a custom training set that is specific to the context of application (e.g., mbta transit stations) would be desirable for achieving the most accurate object detection outcomes, it is very costly to create a large tagged training set from scratch. the common objects in context (coco) dataset is a large-scale object detection, segmentation, and captioning dataset that is freely available to provide default parameter values for yolo. the coco dataset is not specific to passengers or transit stations, but it is a general dataset that includes , images, . million tagged objects and object types, including "person" (lin et al., ) . nevertheless, the tool is effective for identifying individual people in camera feeds, and the use of general training data allows the same tool to be applied in other contexts without requiring additional training data. the proposed methodology aims to estimate the number of left behind passengers at a transit station when trains are too crowded to board. fig. presents a flowchart of the data and methods used in this study in order to provide a roadmap for the analysis described in this paper. the methods rely heavily on two data sources that are automatically collected and recorded (shown in blue): train tracking records that indicate train locations over time, and surveillance video feeds. additional archived data on inferred travel patterns from farecard records is used only to identify the most crowded parts of the system (shown in purple), and manual counts are used to estimate and validate models (shown in red). for model implementation, the proposed models require only the automatically collected input data. the first step of the analysis presented in this paper is to identify the stations and times of day when crowding is most likely to cause passengers to be left behind on the platform. this analysis is used only for determining where to collect data to demonstrate the implementation of the proposed model. this step could be skipped for cases in which the locations for implementation are already c. sipetas, et al. transportation research part c ( ) known. the identification of study sites involves a crowding analysis that makes use of two data sources: train tracking records, which denote the locations of trains over time; and origin-destination-transfer (odx) passenger flows, which are inferred from passenger farecard data. peaks in train occupancy and numbers of boarding passengers show where and when passengers are most likely to be left behind, as described in section . . then, section . describes an analysis of surveillance camera views to determine which stations have unobstructed platform views and station geometry that allows the automated video analysis techniques to be used to count passengers. train tracking data, which includes the time each train enters a track circuit, is automatically recorded into the mbta research database. by comparing this data against manual observations of the times that train doors open and close in the station, a linear regression model is estimated to predict dwell time from the train tracking records, as described in section . . this model is used to obtain automated dwell time estimates as inputs to the model of left behind passengers. automated counts of the number of passengers on each station platform are obtained using yolo, an automated image detection algorithm. the parameters of the algorithm are associated with the freely-available coco training dataset, as described in section . the threshold for object identification is calibrated, as described in section . , by applying the algorithm to the surveillance video feed and comparing with manual counts of the passengers remaining on the platform after the doors have closed (section . ) and the passengers entering and exiting the platform (section . ). with the parameter values and calibrated threshold, yolo produces estimates of the number of passengers on the platform as a time series. the number of passengers that remain on the platform after the doors close is a raw automated passenger count, as shown in section . . these raw counts are not very accurate as a direct measure (section . ), but they provide a useful input for modeling the number of left behind passengers. a logistic regression is used to predict the probability that a passenger is left behind on the station platform based on automated dwell time estimates and/or automated passenger counts from video. the model parameters are estimated using the manually c. sipetas, et al. transportation research part c ( ) observed counts of passengers left behind on the station platforms as the observed outcome. in this study, data collected on november , , were used for model estimation. the diagnostics, parameters, and fit statistics are presented for three models in section . . the quality of the proposed models is evaluated through validation against manually collected counts on a different day. in this study, the estimated models are used to predict the number of left behind passengers using automated dwell time estimates and automated passenger counts on january , . the accuracy of the model predictions is then calculated relative to manually observed passenger counts on the same day, as shown in section . . implementation of the model to make ongoing estimates of the numbers of passengers left behind each departing train requires only train tracking data and surveillance video feeds as model inputs. the manual observations of door opening/closing times and the number of passengers on the platforms are used only for estimating model parameters. the models then produce predictions of the number of passengers left behind each departing train based only on data that is automatically collected. therefore, the numbers of left behind passengers and the associated impact on the distribution of wait times experienced by passengers can be tracked as a performance measure over time. if data feeds were processed as they are recorded, it would also be possible to implement the models to make real-time predictions of the left behind passengers. to test the implementation of object detection with video in transit stations, a first step is to identify locations and times to collect video feeds as well as direct manual observations of left-behind passengers. for this study, stations were selected based on a crowding analysis and evaluation of station geometry and camera view characteristics. the goal was to identify stations with the greatest likelihood of passengers being left behind during a typical morning or afternoon rush and where object detection techniques would be most successful. the analysis focused on the orange line, which is miles long with stations. oak grove and forest hills are the northern and southern end stations, respectively. there are two main reasons for choosing this specific line. first and most important, it has no branch lines, so all travelers can reach their destination by boarding the next available train. this simplifies the identification of left-behind passengers. second, it passes through several transfer stations in the center of boston, which highlights its significance for passengers' daily commuting. a crowding analysis is a necessary step to identify the times and stations where crowding is observed and left behinds have the highest probability of occurring. the data used in this part of the analysis have been extracted from the rail flow database in the mbta research and analytics platform. the rail flow dataset includes aggregated boarding and alighting counts by time of day with -min temporal resolution averaged across all days in a calendar quarter. an example is given in fig. for : - : pm in winter . these data are derived from the origin-destination-transfer (odx) model, which makes use of afc and avl systems to infer the flow of passengers within the subway (sánchez-martínez, ). the odx model identifies records from afc that can be linked in order to infer transfers or return trip patterns. for example, a passenger using a charlie card (mbta's farecard) to enter a rail station and later board a bus near a different rail station can be assumed to have used the rail system and then transferred to the bus. another passenger who enters one rail station in the morning and enters a different rail station in the afternoon may be completing a round-trip commute, so the destination of the morning and afternoon trips can be inferred by linking the two trips. some trip origins and/or destinations cannot be inferred, for example if the fare is paid with cash or the trip has only one farecard transaction. for more details about the odx model, the reader is referred to sánchez-martínez ( ) , where the model's application inferred the origins of % and the destinations of % of the total number of fare transactions. for the crowding analysis in this paper, cumulative counts of passengers boarding and alighting at each station have been created along the direction of train travel using the aggregated railflow data. for a -min time period, b n t ( , ) is the cumulative count of all passengers that board trains in the direction of interest at stations preceding and including station n during time interval t. similarly, a n t ( , ), is the cumulative count of passengers that are assumed to have exited trains traveling in the direction of interest at stations preceding and including station n during time interval t. it should always be true that a n t b n t ( , ) ( , ), because passengers can only alight a train after boarding it. the difference between the cumulative boardings, b n t ( , ), and alightings, a n t ( , ), is the estimated passenger flow, q n t ( , ), between station n and + n during each -min time period. this calculation is approximate, because cumulative counts are calculated for a single -min time period, and real trains take more than min to traverse the length of a line. to calculate the number of passengers per train, the passenger flow per time period must be converted to passenger occupancy, o n t ( , ) (passengers/train), which is calculated by multiplying the passenger flow by the scheduled headway of trains, h t ( ) (minutes), at time t. c. sipetas, et al. transportation research part c ( ) the headway is divided by min to account for the fact that the passenger flow is per -min time period. this measure is an approximation of the number of passengers onboard each train that is based on the assumptions that headways are uniform and passengers are always able to board the next arriving train. in reality, variations in headways may lead to increased crowding after longer headways, increasing the likelihood that some passengers will be left behind. the mbta service delivery policy (sdp) (mbta, ) provides guidelines for reliability and vehicle loads. in the mbta sdp (mbta, ), the maximum vehicle load was explicitly defined as % of seating capacity in the peak hours (start of service to : am; : pm - : pm) and % of the seating capacity in other hours. the sdp notes that accurately monitoring the passenger occupancy of heavy rail transit is not yet feasible on the mbta system. nevertheless, the guidelines from table b in the sdp are used to identify general crowding levels, recognizing that each orange line train is six cars long and has a total of seats. a visualization of average train occupancy for the winter rail flow data is shown in the color plot in fig. a . the color for each station and -min time interval corresponds to the value of o n t ( , ) . since the trains have seats, red parts of the plot indicate large numbers of standing passengers, with dark red indicating crowding near vehicle capacity. this figure shows that in the northbound direction, the most severe crowding occurs between downtown crossing and north station shortly before : pm. note that the crowding appears to decrease before rebounding again at : pm. this is due to the change in scheduled headway at : pm from min to min, which increases occupancy, as calculated in eq. ( ). c. sipetas, et al. transportation research part c ( ) a more detailed visualization combines transit vehicle location records and inferred origin-destination trip flows from a specific date. as mentioned already, the odx trip flows are constructed with simplifying assumptions about passenger movements; for example, all passengers entering a station are assumed to board the first arriving train. despite such assumptions, however, the model is valuable for many applications. the trajectories in fig. b are associated with the recorded arrival and departure times of train at each station. the colors are associated with the estimated train occupancy based on the inferred boardings and alightings, assuming that no passengers are left behind. the trajectory plot shows that the headways between trains can vary substantially, especially for c. sipetas, et al. transportation research part c ( ) the stations north of downtown crossing. longer headways are followed by more crowded trains, because more passengers have arrived to board since the previous train. the occurrence of left-behind passengers would make actual train occupancies slightly lower for the trains following long headways. those left-behind passengers would then be waiting to board the next train, thereby increasing the occupancy on one or more subsequent trains. tracking the average number of passengers onboard trains provides an indicator for the likelihood of passengers being left behind, because full trains leave little room for additional passengers to board. during the most crowded times of the day, it is also useful to look at the numbers of passengers boarding and alighting trains at each station. passengers are most likely to be left behind at stations where trains arrive with high occupancy, few passengers alight, and many more passengers wait to board. by this measure, north station in the afternoon peak appears to be an ideal candidate for observing left behind passengers. using the same method for the southbound direction, sullivan square station was identified as an ideal candidate location for data collection in the morning peak. other candidate stations include back bay, chinatown and wellington stations. in addition to identifying stations with the greatest likelihood of passengers getting left behind crowded trains, the stations that are selected for detailed analysis should also have characteristics that are amenable to successful testing of video surveillance counting methods. there are a variety of station layouts and architectures that contribute complicating factors to the analysis of left behind passengers, and the goal of this study is to identify the potential for the adopted detection method under the best possible conditions. ideal conditions for the proposed analysis are: • dedicated platform for line and direction of interest -in this case, all passengers on a platform are waiting for the same train, so any passenger that does not board can be counted as being left behind. in the case of an island platform, observed passengers may be waiting for trains arriving on either track. in the mbta system, more than half of the station platforms for heavy rail rapid transit in the city center (the most crowded part of the system) meet this criterion. • high quality camera views -surveillance cameras vary in age, quality, and placement throughout the mbta system. newer cameras have higher definition video feeds. the quality of the view is also affected by lighting conditions, especially at aboveground station where sunlight and shadows can affect the clarity of the images. • platform coverage of camera views -the surveillance systems are designed to provide views of the entire platform area for security purposes. in some stations, the locations of columns obfuscate the views, requiring more cameras to provide this coverage. surveillance camera views were considered from five stations on the orange line (back bay, chinatown, north station, sullivan square, and wellington) that were identified through crowding analysis as candidate stations. ultimately, north station was selected as the study site for the northbound direction afternoon peak period because the station exhibits consistent crowding and the geometry provided good camera views. samples of the camera views from this station are shown in fig. . manual observations on the platform needed to be collected to establish a ground truth against which to compare alternative methods for measuring and estimating the number of passengers left behind crowded trains. detailed data collection at north station was conducted during afternoon peak hours ( : - : pm) on midweek days during non-holiday weeks (wednesday, november , , and wednesday, january , ) . three observers worked simultaneously on the station platform to record observations. although train-tracking records (ttr) report the times that each train enters the track circuit associated with a station, there is no automated record of the precise times that doors open and close. since passengers can only board and alight trains while the doors are open, recording these times manually is important for identifying when passengers board trains, when they are left behind, the precise dwell time in the station, and the precise headway between trains. each of the three observers recorded the times of doors opening and closing. the average of these observations is considered the true value. a simple linear regression model shows that observed dwell times (time from doors opening to doors closing) can be accurately estimated from automatic records of ttr arrival and departure times associated with each station. fig. shows the data and regression results combining manual counts for november , and january , . there is no systematic difference between records from different days, and the r is greater than . , indicating a good fit. all stations from tufts medical center through haymarket and the northbound platform at north station on the orange line ( platforms), three out of four blue line stations in downtown boston ( platforms), and all northbound platforms for the red line from south station to porter ( platforms) meet this criterion. c. sipetas, et al. transportation research part c ( ) each observer counted the number of passengers left behind on the station platforms after the train doors closed. in order to avoid double-counting, each observer was responsible for observing passengers in a two-car segment of the six-car train (front, middle, and back). some judgement was necessary in determining which passengers to count, because some passengers linger on the platform after alighting the train and some choose to wait for a later train even when there is clearly space available to board. the goal of the left-behind passenger count is to measure the number of passengers that are left behind due to crowding within ± passengers of the true number. in addition to counting the number of passengers left behind by crowded trains, it is important for model calibration to get an accurate count of the number of passengers waiting to board each arriving train. given the large number of commuters using the heavy rail system during commuting hours, it is not possible to accurately count this total number of passengers in person. surveillance video feeds of escalators, stairs, and elevators used to access the platform of interest were used to manually count the number of passengers entering and exiting the platform offline. specifically, an open-source software tool was used to track passenger movements by logging keystrokes to the video timestamp during playback (campbell, ) . counts were conducted by watching the surveillance video playback of each entry and exit point from the platform and logging the entry and exit of each individual passenger. the resulting data log records the time (to the nearest second) that each passenger entered and exited the platform. since the platforms of interest serve only one train line in one direction, all entering passengers are assumed to wait to board the next train, and all exiting passengers are assumed to have alighted the previous train. combining these counts with the direct observations of the number of passengers left behind each time the doors close provides an accurate estimate of the number of passengers that were successfully able to board each train. fig. illustrates the cumulative numbers of passengers entering the platform (blue curve) and boarding the trains (orange curve). the steps in the orange curve correspond to the times that the train doors close. if passengers are assumed to arrive onto the platform and board trains in first-in-first-out (fifo) order, the red arrow represents the waiting time that is experienced by the respective passenger, which is estimated as the difference between the arrival and the boarding time. a timeseries of the actual number of passengers waiting on the platform is constructed by counting the cumulative arrivals of passengers to the platform over time and assuming that all passengers board departing trains except those that are observed to be left behind. this ground truth for data collected on november , , is shown in blue in fig. . the sawtooth pattern shows the growing number of passengers on the platform as time elapses from the previous train. the drops correspond to the times when doors close. at these times, the platform count usually drops to zero. when passengers are left behind, the timeseries drops to the number of left behind passengers. one such case is illustrated with the red arrow just before : in fig. . fig. . selected camera views from north station, orange line, northbound direction. c. sipetas, et al. transportation research part c ( ) . automated detection of passengers on platforms in video feeds the yolo algorithm uses pattern recognition to identify objects in an image. the coco training dataset was used to define the object detection parameters in yolo, as described in section . a threshold for certainty can also be calibrated to adjust the number of identified objects in a specific frame. if the threshold is set too high, the algorithm will fail to recognize some objects that do not adequately match the training dataset. if the threshold is set too low, the algorithm will falsely identify objects that are not really present. in order to identify the optimal threshold, frames from camera views were analyzed. each frame was analyzed separately for threshold values ranging from % to % to determine the optimal threshold value in relation to a manual count of passengers visible in the frame. the optimal threshold across all camera views is %, which minimizes the mean squared error between yolo and manual counts as shown in table . fig. shows the identified objects at each threshold level for the same frame from a camera installed in north station. the input for yolo is a set of frames, each of which are analyzed independently to detect objects. the algorithm runs quickly enough to analyze each frame in less than one second, so the surveillance video feeds are sampled at one frame per second to allow yolo to run faster than real time. although the analysis for this paper was conducted off line, it would be possible to implement the algorithm in real time. the output from yolo is a text file that lists the objects detected for each frame and the bounding box for the object within the image. a time series count of passengers on the platform is simply the number of "person" objects identified in the corresponding c. sipetas, et al. transportation research part c ( ) frames from each sample video feed. fig. a shows the raw passenger counts on the platform at north station for the time period from : pm - : pm on november , . although there are noisy fluctuations, there is a clear pattern of increasing passenger counts until door opening times (green). a surge of passenger counts while doors are open (between green and red) represents the passengers alighting the train and exiting the platform. passenger counts drop off dramatically following the door closing time (red), except in cases that passengers are left behind. for example, the third train in fig. a arrives after a long headway and shows roughly nine passengers left behind. to facilitate analysis of the automatic passenger counts from the surveillance videos, it is useful to work with a smoothed time series of passenger counts. using a smoothing window of ± seconds, the smoothed series is shown in fig. b . this smoothed time series is more suitable for a local search to identify the minimum passenger count following each door closing time. this represents the count of left-behind passengers identified through the automated object detection process. the smoothed video counts from the three surveillance camera feeds used to monitor the northbound orange line platform at north station are shown as the green curve in fig. . the automated passenger counting algorithm clearly undercounts the total number of passengers on the platform. the reason for this large discrepancy is that the algorithm can only identify people in the foreground of the images, where each person is large. therefore, the available camera views do not actually provide complete coverage of the platform for automated counting purposes. furthermore, when conditions get very crowded, it becomes more difficult to identify separate bodies within the large mass of people. the problem of undercounting aside, it is clear that the automated counts generate a pattern that is representative of the total number of passengers on the platform. using regression, the smoothed timeseries can be linearly transformed into a scaled timeseries (the orange curve in fig. ) , which minimizes the squared error compared with the manually counted timeseries. using this scaling method, the data from november , , were used to compare estimated counts of left-behind passengers in the peak periods with the directly observed values. this provides a measure of the accuracy of automated video counts. the total number of left-behind passengers estimated by this method is presented in table , where the root mean squared error (rmse) is calculated by comparing the number of passengers left-behind each time the train doors close. the scaling process, which makes the blue and orange curves in fig. match as closely as possible, results in substantially overcounted left behinds, because the scaling factor tends to over-inflate the counts when there are few passengers on the platform. as a direct measurement method, automated video counting is not satisfactory, at least as implemented with yolo. however, fig. c. sipetas, et al. transportation research part c ( ) shows a clear relationship between the video counts and passengers being left behind on station platforms, so there is potential to use the video feed as an explanatory variable in a model to estimate the likelihood of passengers being unable to board a train. in order to improve the accuracy of estimates of the number of passengers left behind on subway platforms, a logistic regression model is formulated to estimate the probability that each passenger is left behind based on explanatory variables that can be collected automatically. a logistic regression is used to estimate the number of passengers left behind by way of estimating the probability that each waiting passenger is left behind, because the logistic function has properties that are more amenable to this application. since passengers are only left behind when platforms and trains are very crowded, a linear regression has a tendency to provide many negative estimates of left behind passengers, which are physically impossible. the binary logit model, by contrast is intended for estimating the probability that one of two possible outcomes is realized (e.g., a passenger is either left behind or not left behind). the estimated probability from a logit model is always between and , so the resulting estimate of the number of left-behind passengers is always non-negative and cannot exceed the total number of waiting passengers. for estimation of the logistic regression, each passenger is represented as a separate observation, and all passengers waiting for the same departing train are associated with the same set of explanatory variables. over the course of a -h rush period, there are typically about trains serving north station, serving , to , passengers per period, and leaving behind well over c. sipetas, et al. transportation research part c ( ) passengers. logistic regression models are generally expected to give stable estimates when the data set for fitting includes at least observations for each outcome, so there is sufficient data to estimate parameters for a model that is structured this way. the logistic function defines the probability that a passenger is left behind by where x is a vector of explanatory variables, is a vector of estimated coefficients for the explanatory variables, and is an estimated alternative-specific constant. the estimation of the model can be thought of as identifying the values of and that best fit the observed outcomes y corresponds to a passenger being left behind, and = y corresponds to a passenger successfully boarding. the underlying assumption in this formulation is that the likelihood of being left behind can be expressed in terms of a linear combination of explanatory variables and a random error term, , which is logistically distributed. the explanatory variables that are considered in this study are as follows: . dwell time (time from door opening to door closing) or difference of ttr arrival and departure times . video count of passengers on platform following doors closing these explanatory variables can all be monitored automatically, without manual observations. video counts of passengers on the platform following doors closing are obtained from the object detection process described above. although dwell time is an appropriate explanatory variable because doors stay open longer when trains are crowded, the dwell time is not directly reported in archived databases. as demonstrated in fig. , observed dwell times can be accurately estimated from automatic records of ttr arrival and departure times. this leads to using ttr reported values of difference between train arrival and departure instead of dwell times for the model development. since these are essentially the same explanatory variable, we call this difference "dwell time" for the remainder of the paper. initially, three models were estimated, making use of only ttr data (model ), only video counts (model ), and then fused ttr c. sipetas, et al. transportation research part c ( ) and video counts (model ). the data from november , , were used to develop these models. the number of passengers waiting on the platform (as described in section . ) are used to determine the number of observations for estimating the parameters of the logit model. in total, passengers boarded arriving trains at north station during the rush period and of them were left behind. this leads to a sample size of passengers for the logistic models. models and are simple logistic regressions, each with only one independent variable. neither model has influential values (i.e., values that, if removed, would improve the fit of the model). model uses both ttr data and video counts, so it is important to diagnose the model's fit, especially with respect to the assumptions of the logistics regression. first, multicollinearity of explanatory variables should be low. the correlation between dwell time and video count is . and the variance inflation factor is . , both indicating that the magnitude of multicollinearity is not too high. second, no influential values were identified. third, the logistic regression is based on the assumption that there is a linear relationship between each explanatory variable and the logit of the response, p p log( /( )), where p represents the probabilities of the response. fig. shows that dwell time is approximately linear with the logit response, while there is somewhat more variability with respect to the video counts. neither plot suggests that there is a systematic mis-specification of the model. a summary of the estimated model coefficients and fit statistics is presented in table . the log likelihood is a measure of how well the estimated probability of a passenger being left behind matches the observations. the null log likelihood is associated with no model at all (every passenger is assigned a % chance of being left behind), and values closer to zero indicate a better fit. the value is a related measure of model fit, with values closer to indicating a better model. for all three models, the estimated coefficients have the expected signs and magnitudes. the positive coefficients for dwell time and video counts indicate a positive relationship with the probability of having left-behind passengers, which is intuitive. in order to compare models, the likelihood ratio statistic is used to determine whether the improvement of one model is statistically significant compared to another. the likelihood ratio test statistic is calculated by comparing the log likelihood of the restricted model (with fewer explanatory variables) to the unrestricted model (with more explanatory variables): comparing model (restricted) to model (unrestricted), one additional variable in model , indicates one degree of freedom, which sipetas, et al. transportation research part c ( ) requires > d . to reject the null hypothesis at the . significance level. comparison between models and gives = d . , indicating that model provides a significant improvement over model by adding video counts. comparison between models and gives = d . , which is also a significant improvement. the akaike information criterion (aic) is an additional model fit statistic that weighs the log likelihood against the complexity of the model. although model has more parameters, the aic is greater than for model or model , indicating that the improved log likelihood justifies the inclusion of both ttr and video count data. the logistic regression provides an estimate of the probability that passengers are left behind each time the train doors close. in order to translate this probability into a passenger count, the estimated number of passengers waiting on the platform from the scaled video count is used as an estimate of the number of passengers waiting to board. table shows the validation results when the models were applied to data collected on january , , for north station. the scaling factor used for the number of passengers waiting on the platform is estimated from november , data. considering the estimated number of left behind passengers for each train separately, it is observed that these models achieve higher accuracy when there are a few passengers left behind. overall, model exhibits error of only . % since it estimates that passengers are left behind in total when passengers were observed to be left behind. model gives a lower estimate of passengers being left behind, which leads to an error of approximately %. as shown in table and table , direct video counts (unscaled and scaled) do not provide accurate estimates of the total numbers of passengers left behind without some additional modeling. the unscaled video counts underestimate the total, while the scaled video counts overestimate the total. the logistic regression provides much better results. although there are some discrepancies for specific train departures, the estimated numbers of passengers left behind are not significantly biased and the total number of passengers left behind during the three-hour rush period is similar to the manually counted total. the logistic regressions estimate the probability of a passenger being left behind using only the explanatory variables listed in table . however, the estimated number of left behind passengers is calculated by multiplying the probability by the scaled video count of passengers on the platform at the time the doors opened, as estimated from the ttr data. therefore, the estimated number of c. sipetas, et al. transportation research part c ( ) passengers left behind with model and model rely only on ttr data that is currently being logged and supplemented by automated counts of passengers in existing surveillance video feeds. the models therefore utilize explanatory variables that are monitored automatically, and they can be deployed for continuous tracking of left behind passengers without needing additional manual counts. the logistic models could actually perform even better if there were a way to obtain a more accurate count of the number of passengers waiting for a train. during the morning peak period, the count of farecards entering outlying stations can provide a good estimate for the number of passengers waiting to board each inbound train. this is more challenging at a transfer station, like north station, in which many passengers are transferring from other lines. in some cases, strategically placed passenger counters could provide useful data. nevertheless, table presents the performance of the developed logistic regression models if their estimated probabilities are multiplied by the actual number of passengers on the platform instead of the estimated number as in table . this reveals the value of more accurate data, because model decreases its error compared to table . model in table estimates passengers being left behind in the afternoon rush on the observed date when the previous estimate was , which is a reduction of error from % to % for this model compared to the observed left behind passengers. another way to evaluate the performance of the developed models is to consider whether or not trains that leave behind passengers can be distinguished from trains that allow all passengers to board. through the course of data collection and analysis, the number of passengers being left behind because of overcrowding can only be reliably observed within approximately ± passengers. the reason for this is that sometimes people choose not to board a train for reasons other than crowding, and one or two passengers left on the platform did not appear to be consistent with problematic crowding conditions. if a train is defined to be leaving behind passengers when more than passengers are left behind, the results presented in table can be reinterpreted to evaluate each method by four measures: the number of trains in a time period that leave behind passengers due to overcrowding. . correct identification rate: the percent of trains that are correctly classified as leaving behind passengers or not leaving behind passengers, as compared to the manual count. this value should be as close to as possible. . detection rate: the percent of departing trains that were manually observed to leave behind passengers that are also flagged as such by the estimation method. this value should be as close to as possible. . false detection rate: the percent of departing trains that are estimated to leave behind passengers but have not, according to manual observations. this value should be as close to as possible. there is an important distinction to make here, because there are two ways that the model to identify trains leaving behind passengers can be used: (a) to estimate the number of trains that leave behind passengers, in which case we only care about measure ; or (b) to identify which specific trains are leaving behind passengers, in which case measures through are important. depending on how the data will be used, application (a) or (b) may be more relevant. for example, application (a) provides an aggregate measure of the number of trains leaving behind passengers. application (b), on the other hand, is what would be needed to get toward a real-time system for identifying (even predicting) left-behind passengers. a comparison of the four measures is presented in table for the trains that departed north station between : pm and : pm on january , . unscaled video counts provide a good estimate of the number of trains that leave behind passengers sipetas, et al. transportation research part c ( ) (measure ), but suffer from a low detection rate and high false detection rate. scaled video counts are poor estimators for the occurrence of left-behind passengers because they are high enough to trigger too many false detections. the modeled estimates both perform well in approaching the actual number of trains leaving behind passengers. model has the best performance for measures through . it never falsely identifies a train as leaving behind passengers, and it correctly detects most occurrences of passengers being left behind. like the count estimates above, both model and model rely on the scaled video counts to estimate the number of passengers waiting on the platform when the train doors open, so a fusion of ttr records and automated video counts provide the most reliable measures. another application of the model is to consider the distribution of waiting times implied by the estimated probabilities that passengers are left behind each departing train. from the direct manual counts, a cumulative count of passengers arriving onto the platform and of passengers boarding trains provides a timeseries count of the number of passengers on the platform. if passengers are assumed to board trains in the same order that they enter the platform, the system follows a first-in-first-out (fifo) queue discipline. although it is certainly not true that passengers follow fifo order in all cases, this assumption allows the cumulative count curves to be converted into estimated waiting times for each individual passenger. the fifo assumption yields the minimum possible waiting time that each passenger could experience, and the waiting time for each passenger can be represented graphically by the horizontal distance between the cumulative number of passengers entering the platform and boarding trains (see fig. for data from november , ). the yellow curve in fig. a represents the cumulative distribution of waiting times that are implied by the observed numbers of passengers entering the platform if all passengers on the platform are assumed to be able to board the next departing train. we call this the expected waiting time. the blue curve in fig. a is the cumulative distribution of waiting times if the number of left-behind passengers are accounted for when trains are too crowded to board. we call this the observed waiting time, because it reflects direct observation of passengers waiting on the platform using manual counts. the distribution indicates the percentage of passengers that wait less than the published headway for a train departure, which is the reliability metric used by the mbta. for the orange line during peak hours, the published headway is min ( s). currently, the mbta is only able to track the expected wait time as a performance metric. the difference between the yellow and blue curves indicates that failing to account for left-behind passengers leads to overestimation of the reliability of the system. the models developed in this study provide the estimated probability that a passenger is left behind each time the train doors close. in the absence of additional passenger count data, a constant arrival rate is assumed over the course of the rush period, the door closing times from ttr and the probability of passengers being left behind from model can be used to estimate the cumulative passenger boardings onto trains over time. under the same fifo assumptions described above, the distribution of experienced waiting times can be estimated based on train-tracking and video counts. by this process a cumulative distribution of waiting times is estimated using probabilities from model is shown as a red curve in fig. b , which we call the uniform arrivals modeled wait time. table includes the values of experienced waiting times for the observed, the expected, and the modeled distributions. this table also shows how the accuracy of estimating waiting times can be improved if we consider the actual arrival rate under the same assumptions used to develop the uniform arrivals modeled wait time. we call this distribution the actual arrivals modeled wait time. the earth mover's distance (emd) is used to measure the difference between the observed distribution and the expected, uniform arrivals and actual arrivals modeled distributions (rubner et al., ) . as shown in table , the emd for the expected case is much higher than the emd for the modeled cases, which indicates that the proposed model reduces errors. the modeled distributions of waiting times closely approximate the observed distribution. this suggests that the estimated probabilities of passengers being left behind each departing train are consistent with the overall passenger experience. the percentage of passengers experiencing waiting times lower or equal to the min published headway is % for both the observed and uniform arrivals model curve, and % for the actual arrivals model curve. the automated count of left behind passengers provides a close approximation of the actual service reliability when applied to the independent data collected on january , . the expected distribution, which does not account for left-behind passengers produces an estimate of % of passengers waiting less than min. the expected distribution overestimates the reliability of the system by failing to account for the waiting time that left-behind passengers experience. this paper presents a method for measuring passengers that are left behind overcrowded trains in transit stations without records of exiting passengers. a study performed by miller et al. ( ) also addresses this challenging case using manual video counts to calibrate the developed models. the methodology proposed in this paper uses archived data with automatic video counts as inputs to estimate the total number of left behind passengers during peak demand periods. the automatic video counts are obtained through the implementation of image processing tools. this paper presents an investigation of the effects of accounting for left behind passengers on the estimation of the current reliability metric used by the mbta, the experienced waiting times. following a preliminary study of crowding conditions on the mbta's orange line, data collection and analysis focused specifically on northbound trains at north station during the afternoon peak hours. data was collected on two typical weekdays and confirmed that overcrowding is a common problem, even on days without disruptions to service. this is an indication that the system is operating very near capacity, and even small fluctuations in headways lead to overcrowded trains that result in left-behind passengers. this study specifically investigated the potential for measuring the number of left-behind passengers using existing data sources c. sipetas, et al. transportation research part c ( ) and automated passenger counts derived from existing surveillance video feeds. the analysis of automated passenger counts was based on the implementation of a fast, open-source algorithm called you only look once (yolo) using existing training sets that identify people as well as other objects. the performance is fast enough that frames from surveillance video feeds could potentially be analyzed in real time. although video counts were not accurate in isolation, the development of models to use automated video counts with automated train-tracking records (model ) demonstrated good results for different applications. in predicting the number of trains leaving c. sipetas, et al. transportation research part c ( ) behind passengers, the developed models can correctly identify whether or not passengers were left behind for % of the trains. the number of passengers that are left behind during the afternoon rush period can be estimated within % of their actual number using only automated video counts and automatically collected train tracking records. with actual counts of the numbers of passengers on the station platform at each train arrival the model can predict the number of left behind passengers with % of the actual number. furthermore, the modeled distribution of experienced waiting times reduced the total emd error by more than % compared to the error of the operator's expected distribution, where left-behind passengers are not considered. this highlights the need of accounting for left-behind passengers when tracking the system's reliability metrics. there are a number of ways that this study could be extended. one approach would be to implement and evaluate the developed models over more days. in terms of passenger flow data, the odx model has some known drawbacks given existing limitations, such as lack of tap-out farecard data or passenger counters on trains. in systems without these limitations, the developed models could achieve higher accuracy. the methodology presented here could also be combined with the previous study by miller et al. ( ) in order to improve the overall process for estimating left behind passengers in subway systems without tap-out. comparing the two studies, miller et al. ( ) achieves higher accuracy for very crowded conditions, whereas our method performs better when there are few passengers left behind. the automated object detection presented in our study could also be combined with the model proposed by miller et al. ( ) as part of its real-time implementation in case of special events where real-time afc is not available. in the area of image processing, a number of steps could be taken to improve the accuracy of video counts and extend the feasibility to more challenging station environments. suggested approaches include comparing the algorithm with other fast and accurate video detection algorithms and training the algorithm to detect heads rather than whole bodies. although there are limitations to any single data source, the potential for improving performance metrics through data fusion and modeling continues to grow. valuing crowding in public transport: implications for cost-benefit analysis simple player uncovering the influence of commuters' perception on the reliability ratio training mixture of weighted svm for object detection using em algorithm people silhouette extraction from people detection bounding boxes in images what does classifying more than , image categories tell us? imagenet: a large-scale hierarchical image database estimating the cost to passengers of station crowding waiting time perceptions at transit stops and stations: effects of basic amenities, gender, and security rich feature hierarchies for accurate object detection and semantic segmentation the distribution of crowding costs in public transport: new evidence from paris crowding in public transport: who cares and why? crowding cost estimation with large scale smart card and vehicle location data does crowding affect the path choice of metro passengers? transit service and quality of service manual discomfort externalities and marginal cost transit fares image processing and analysis with graphs: theory and practice crowding and public transport: a review of willingness to pay evidence and its relevance in project appraisal microsoft coco: common objects in context urban commuting: crowdedness and catecholamine excretion mining smart card data for transit riders' travel patterns estimation of denied boarding in urban rail systems: alternative formulations and comparative analysis massachusetts bay transportation authority estimation of passengers left behind by trains in high-frequency transit service operating near capacity smart card data use in public transit: a literature review a behavioural comparison of route choice on metro networks: time, transfers, crowding, topology and sociodemographics darknet: open source neural networks in c you only look once: unified, real-time object detection the earth mover's distance as a metric for image retrieval inference of public transportation trip destinations by using fare transaction and vehicle location data: dynamic programming approach image processing, analysis, and machine vision crowding in public transport systems: effects on users, operation and implications for the estimation of demand a motion-based image processing system for detecting potentially dangerous situations in underground railway stations feature-based recognition of objects ensemble methods: foundations and algorithms inferring left behind passengers in congested metro systems from automated data this study was undertaken as part of the massachusetts department of transportation research program. this program is funded with federal highway administration (fhwa) and state planning and research (spr) funds. through this program, applied research is conducted on topics of importance to the commonwealth of massachusetts transportation agencies. sipetas, et al. transportation research part c ( ) key: cord- -o rchhw authors: el qadmiry, m.; tahri, e.; hassouni, y. title: on the true numbers of covid- infections: behind the available data date: - - journal: nan doi: . / . . . sha: doc_id: cord_uid: o rchhw in december- china reported several cases of a novel coronavirus later called covid- . in this work, we will use a probabilistic method for approximating the true daily numbers of infected. based on two distribution functions to describe the spontaneous recovered cases on the one hand and the detected cases on the other hand. the impact of the underlying variables of these functions is discussed. the detected rate is predicted to be between . % and , %, which means that there would be about million infected until now ( -may ), rather than the officially declared number of . million worldwide cases. since the outbreak of the novel coronavirus covid- and its spread around the world, several published works were concerned with the analysis of the available data to extract information that characterizes the daily evolution of this new pandemic. among the concerned information, there are the incubation period [ , ] , the reproduction number [ ] , mortality rate [ ] , and the asymptomatic proportion as in [ ] , where was used the data of the diamond princess cruise ship; which means that the study is done on a closed population, a fact that is worthy to mention. our aim in this work is to approach the true number of infected cases, and to develop an analytical method in order to allow the simulation of different scenarios that can occur if we modify the underlying variables of two special medkadmiri@hotmail.fr hassanfa@yahoo.com y-hassou@fsr.ac.ma probabilistic functions. these probabilistic functions control the numbers of the infected as well as the numbers of the detected cases. to predict the evolution of the true number of cases; which escapes from the official counting, we take into account the reproduction number. the importance of this investigation lies in the fact that the important part of the infection comes from people who are not in hospitals or not isolated, i.e., that the virus transmission occurs in general before symptoms appear [ ] . then, we think that the numerical simulation achieved here can be helpful to estimate the consequences of different strategies adopted at the governmental level, such as social distancing, strict-confinement for a short period, mass testing, and other measures that aim to stop the evolution of the pandemic. in this section, we will introduce the distribution functions that describe the spontaneous outcome cases and cases detected. in addition to the reproduction number responsible for exponential growth. when there is no medical intervention in favor of the infected, the biological defending system has to withstand alone as a natural solution. therefore, eventually the cases end by either recovering disability or death. thus, for a closed population initially infected, the number of infected keeps steadily decreasing since the illness onset until the end of infection, concerning many factors including age, health status, viral load, immunity, etc. considering these facts, we will suggest a suitable distribution function to lower the amount of the infected population through time. it is denoted as ps(t) -(s)-subscript as survival-and its analytical expression is chosen so as to be compatible with the empirical observation taken from the data available in [ ] based on control patients, so it is given as follows: . cc-by . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not certified by peer review) the copyright holder for this preprint this version posted may , . . https://doi.org/ . / . . . doi: medrxiv preprint where ( ) is the heaviside step function equals to if ≤ and to otherwise, it ensures here that the function vanishes if the time exceeds the period of -days since the illness onset. this is because there is no remaining infected after this period, which after all, we will see that it does not have a significant impact on the results. let us note also that we do not use the polynomial regression to describe ps(t), because it demands a greet polynomial degree. in fig: ( ) , we show the behavior of the ps(t) function. the real curve can be extracted from the data concerning the control patients under the drug treatments experiment as for example in [ , ] , here we use the data of the first one as we mention above. the comparison is shown in fig: ( ), where we take the initial number of infected to be . thus, we plot the expected survival cases -at every day-given by the product × ps(t), with the real survival cases following the data. the standard error equals to , which means that we have a good approximation. days since the ilness onset survival probability . cc-by . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . https://doi.org/ . / . . . doi: medrxiv preprint there is no doubt that the cases counting is not perfect, because there are many factors that may affect the results, such as test accuracy; viral load; medical protocol; testing strategy... nevertheless, we can estimate another cumulative distribution function (cdf), which describes the probability to detect the infected cases among the entire population. this function is wellknown used for several precedent epidemics (see [ ] ), e.g., sars coronavirus; rhinovirus; influenza b... the logistic distribution is useful to recognize this cdf, it is characterized by two parameters: the first one, denoted by m, represents the maximum absolute accuracy of testing and the second one is denoted by δ and represents the incubation period of covid- , which is estimated to be between . and . days (see [ ] ), e.g. it has fixed to . days in [ ] and . days in [ ] . let us denote this distribution function by pd(t) -(d)-subscript as detected, and "t" as time representing days since infection-and write its explicit expression in the following manner: in what follows, we will choose the value of the incubation period as δ = . -days following the result of the reference [ ] (where it was chosen for the accuracy of the data used in the analysis). concerning the testing days since the admition survival cases . cc-by . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . https://doi.org/ . / . . . doi: medrxiv preprint accuracy, we take the maximum equal to m = %, which is a widely varied quantity regarding the testing strategy and the asymptomatic proportion estimated in [ ] to be . %. in fig:( ) we show the evolution of pd(t) cdf with respect to the above conditions, from which we get the probability to detect . % after days; % after days; % after days. the reproduction number is an important index to evaluate the spread of an epidemic. it is defined as the number of infected caused by an infectious person during the infectious period, denoted by r . what interests us in this work is the infection-producing contacts per one day, which is described by a function of time denoted by r(t), obtained simply as the ratio of the reproduction number r to the infectious period τ (neglecting the latent period). this quantity is sensitive to the confinement measures and always lower with respect to time. we suppose that r(t) follows an exponential evolution, which we could justify by the fact that at early times it is much easier to reduce the infected-healthy contacts than at a late time. therefore, we can write it as: days since the illness onset detection probability . cc-by . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . where r = r /τ is the initial value calculated from the value of the reproduction number r , which itself is estimated according to several considerations as . -days in [ ] or . -days in [ ] . here, we will use the last one as a precaution to prevent bad decisions, thus r = . . the parameter d is the number that characterizes the isolation strategy, which is -by definition-the number of days required to reduce the initial value r by /e ≃ . . in our case, we must choose it as d = -days to get the simulation of detected cases compatible with the worldwide data of confirmed cases. let us first start with a simulation to see what happens if we have a semiclosed infected population, where it is only possible to lose the infected without any gain. therefore, let us fix the initial population to be n = infected and extract every detected case following the pd(t) cdf. to do so, every time tn= n-days, we recalculate the numbers of infected and detected cases, where t = is the moment of illness onset for the entire population. the number of infected n(t) at tn is then given by: where we introduced a new distribution function p(t) written as follows: the number of detected cases Ñ(t) among the infected cases is given by: where ̃( ) is a cdf function defined by: the behavior of these two numbers n and Ñ with respect to time -since the illness onset-is given in fig: ( ) . it is noteworthy that the detected cases . cc-by . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . https://doi.org/ . / . . . doi: medrxiv preprint reach its maximum value near the incubation period and decrease afterward. the accumulation of detected cases is less than the initial number of infected n by a rate of % (the majority of infected are not detected). in the following section, we shall consider each "new cases" as a semiclosed population. remark that after only -days there is no infected stay free, they are either isolated or recovered. for the last reason, we know that the period from the illness onset to death does not play a significant role in the detection mechanism. now, to apply the method to the worldwide status, the infected population must be considered as an open population where it is possible to gain and lose the infected. in essence, every day we must recalculate its remain cases as the following: where we suppose that the new cases are given through the exponential growth [ ] . explicitly, if we suppose that the latent period [ ] is zero so as days since the illness onset daily number of cases . cc-by . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . https://doi.org/ . / . . . doi: medrxiv preprint to be able to multiply by r(t) since the first day of infection, we can then write it as: new cases(tn) = r(tn− ) * (remains cases(tn− ) ). in general, we can write the true number of infected cases by a recurrence relation as: where n(tn−i) = if n -days). likewise, the number of detected cases (confirmed) can be expressed as a recurrence relation as follows: where the first terms under the summation symbol represent the detected cases from the previously generated cases and the second term represents the detected cases from the initial number of infected (which is equal to zero if tn> -days). thus, using the above equations ( ) and ( ) by taking an initial number of infected cases as n = . note that there were cases at -dec according to [ ] , and a detected rate claimed to be %. thus, we can realize the simulation showing the evolution of the true infected cases and that of the detected one. the result is given by the two curves in fig: ( ) . cc-by . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . https://doi.org/ . / . . . doi: medrxiv preprint figure : the evolution of the true infected cases n(t) (blue) and detected cases Ñ(t) (orange) in function of time by looking at the two curves, we observe a great difference between them: the confirmed cases appear very smaller than the true infected cases. with a detection rate increasing across time (fig: ( ) ). beginning with an average of . % in the first -days and ending by an average of % in the final -days before the disappearance of the pandemic. these results are well-matched with the results of the studies achieved in [ ] , [ ] . in fig: ( ) we can see the daily evolution of the detection rate is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . https://doi.org/ . / . . . doi: medrxiv preprint worldwide data [ ] . this is what we do in fig: ( ) , indeed, we compare the theoretical curve (dashed) with the real curve (continued) up to the present time. figure : the graphs represent real (continued) and theoretical (dashed) curves of daily-confirmed cases. furthermore, we can deduce a numerical comparison between the simulated and the real accumulated -confirmed-cases ∑ Ñ( ) which are . m and . m respectively (up to -may [ ] in the other side, the true accumulated cases ∑ ( ) give about . m cases rather than . m. across this work, we developed a new method to simulate the true evolution of the covid- pandemic. this method consists of three concepts modeled by three analytic functions: the spontaneous outcome, the detected cases, and the well-known exponential growth. the results show in particular that the simulated curve of the true infected cases gives a large number in comparison with the confirmed one. the method is presented in a way that can be extended for use to other pandemics and/or to a given country. furthermore, this simulation may help in predicting the impact of various measures taken to tackle the epidemic. days since the outbreak . cc-by . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not certified by peer review) the copyright holder for this preprint this version posted may , . . https://doi.org/ . / . . . doi: medrxiv preprint incubation period and other epidemiological characteristics of novel coronavirus infections with right truncation: a statistical analysis of publicly available case data the incubation period of coronavirus disease (covid- ) from publicly reported confirmed cases: estimation and application the reproductive number of covid- is higher compared to sars coronavirus real estimates of mortality following covid- infection. the lancet infectious diseases estimating the asymptomatic proportion of coronavirus disease (covid- ) cases on board the diamond princess cruise ship feasibility of controlling covid- outbreaks by isolation of cases and contacts. the lancet global health association of treatment dose anticoagulation with in-hospital survival among hospitalized patients with covid- hydroxychloroquine and azithromycin as a treatment of covid- : results of an open-label non-randomized clinical trial incubation periods of acute respiratory viral infections: a systematic review. the lancet infectious diseases the incubation period of -ncov infections among travellers from wuhan incubation period of novel coronavirus ( -ncov) infections among travellers from wuhan, china the reproductive number of covid- is higher compared to sars coronavirus estimation of the transmission risk of the -ncov and its implication for public health interventions transmission potential and severity of covid- in south korea averagethe following experimental design was used: ... detection rate of sars-cov- infections is estimated around six percent substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (sars-cov- ) key: cord- -wjapj w authors: liou, je-liang; hsu, pei-chun; wu, pei-ing title: the effect of china's open-door tourism policy on taiwan: promoting or suppressing tourism from other countries to taiwan? date: - - journal: tour manag doi: . /j.tourman. . sha: doc_id: cord_uid: wjapj w this study employs an extended gravity model to analyse the complementarity or competitiveness relationship of the number of inbound tourists and corresponding tourism revenue between china and other nations under the implementation of china's open-door tourism policy to taiwan in . a simulation for – demonstrates the sustained impact of this policy. the results show that the number of tourists to taiwan from china reached its peak in at % and will decrease to % by . the corresponding tourism revenue will decrease from % to % over the same period. the results also show that if the number of tourists from china remains above , , the number of tourists from japan, hong kong, australasia, north america, and europe will still increase. however, the number of tourists from south korea and south and southeast asia will increase continuously regardless of tourists from china, even far below , . tourists who travel from different regions or nations generate economic revenues for destination nations. thus, travel and tourism play important roles in the economic development of some nations, such as fiji (aresh, umar, & aryan, ; eilat & einav, ) . fiji, one of the nations in the pacific region, is a typical tourism nation for tourists from australia, new zealand, the us, canada, the uk, and japan. the national income for fiji is not as large as that of other nations. thus, revenue from tourism is relatively more important than it is for other nations. the tourism revenue for fiji reached % of its gdp in (eilat & einav, ) . similarly, tourism revenue accounted for approximately % of total government revenue in for the maldives (statistics & research section, ministry of tourism, maldives, ) . in , the world tourism organization of the united nations (henceforth unwto) predicted that the total number of tourists will reach . billion by and that tourism will be one of the major sources of revenue for developing nations (unwto ( ) ). data by unwto ( a) also indicate that tourism in asia and the pacific region contributes % of the total tourism revenues of all nations. the development of tourism directly benefits revenues and employment opportunities in the tourism sector and indirectly encourages improvement and investment in new infrastructure and the reformation of (world economic forum (wef), , ) public transportation networks for destination nations. tax revenues are thus expected to increase. tourism not only benefits the revenue of a nation as a whole but also has specific benefits for a city or a region within a nation (neuts, ; tang & abosedra, ) . the strong positive connection between tourism and employment opportunities is even more significant in ecotourism (laterra et al., ) . to determine the relative advantage of each nation's tourism attraction, the travel & tourism competitiveness index (ttci) was constructed by the world economic forum (wef, (wef, , (wef, , (wef, , (wef, , (wef, , (wef, , (wef, , . the ttci is a comprehensive index to calculate each nation's travel and tourism competitiveness. because the ttci is a composite index, it is difficult to identify the performance of any specific factor of a sub-index or certain category of the index for a specific nation (hanafiah & hemdi, ; joshi, poudyal, & larson, ; weaver, kwek, & wang, ) . thus, if a certain factor is prominent or important, it must be calculated individually. studies by song and li ( ) have determined the influence of the relative commodity price level on travel and tourism. other studies have indicated that factors such as the security of the travelling spot, gourmet food, and scenic views are crucial for tourism decisions (cîrstea, ; enright & table total number of tourists from the major nations to taiwan, taiwan, - year the other four inbound nations are india, thailand, the philippines, and vietnam. we were unable to obtain data on the daily expenditures of these four nations. to be consistent with the presentation for the total daily expenditures in the follow-up analyses, the presentation for the total number of tourists combines these four nations into one group. newton, ) . marti and puertas ( ) noted that in europe, the tourism industry is important for reducing poverty and regional differences. tourism increases revenues for destination nations. durbarry and sinclair ( ) studied tourism demand in france and concluded that italy, spain, and great britain (henceforth gb) accounted for % of outbound tourists to france. their study further indicated that from to , the number of tourists to gb decreased by %, and tourism revenue for the nation decreased by %. in contrast, the number of tourists to italy and spain increased by % and %, respectively, and their tourism revenue increased by % and %, respectively. this evidence not only shows the consistent change between the number of inbound tourists and the amount of tourism revenue but also demonstrates tourism competition among nations. thus, each nation uses different ways to attract tourists (harb & bassil, ; kozak, kim, & chon, ; mar� akov� a, dyr, & tuzimek, ; unwto, ) . the unwto has compiled a complete tourist record for each nation since (world tourism organization unwto, ). the record shows that the total number of tourists visiting taiwan was approximately . million in and million in (world tourism organization unwto, ). the number of tourists over years increased by approximately . million, with an average annual increase of , tourists. previously, japan had the largest share of inbound tourists in taiwan. tourists from the united states (henceforth the us) were ranked second. however, this situation changed in . the number of inbound tourists from japan decreased, as did the share of total inbound tourists. a similar situation was observed for inbound tourists from the us. the unwto began recording data on tourists from china to taiwan in . in , there were only , tourists to taiwan. before , only chinese living overseas, studying abroad, with permanent residency in other nations, or transferring to other nations for business purposes were allowed to travel to taiwan. there was clearly a change in . prior to , the total number of tourists from china was approximately . million, accounting for . % of the total number of inbound tourists to taiwan. this number exceeds the number of tourists from many other nations, including japan and the us, in terms of both the number and the share. this situation reversed in , when the number of tourists from china decreased dramatically from its highest level of . million in to . million in and further decreased to . million in . the reason for this significant variation was the implementation of china's open-door tourism policy to taiwan (hereafter open-door policy) in . in , the number of tourists allowed to visit taiwan was relaxed, but it was tightened in when the ruling party of the central government in taiwan changed. thus, this policy has highly political connotations. it is quite different from regular tourism policies that are designed to limit or attract tourists based on tourists' personal qualifications. theoretically and ideally, increasing the number of inbound tourists from any nation should have a positive impact on taiwan's economy. it is generally believed that an increase in the number of tourists will create more employment opportunities in the travel industry, generate more revenue from the tourism sector, and provide frequent cultural exchanges among nations (ap & crompton, ; kwek & lee; omkar, poudyal, & larson, ) . however, these positive impacts may not occur if the number of inbound tourists is less than that stipulated by policies implemented by other nations. that is, if there is a tremendous increase in the number of inbound tourists or a large number of tourists pour into taiwan and tourism revenue increases due to the open-door policy, a decline could occur in the number of tourists and tourism revenues. the changes in the number of chinese tourists to taiwan stated above from to and to are obvious evidence. in a broad sense, the open-door policy can be categorized as a tourism policy under the ttci categories. however, when the implementation of such a policy is imposed by other nations, it affects the nature of tourism as an action of free movement. for tourists, the selection of destinations is not a free choice but should be approved by home nations. for destination nations, this causes the potential number of inbound tourists and potential tourism revenue to become highly uncertain. as such, improvement in any facility or other factor might not be useful for engaging inbound tourists. this uncertainty could make tourism either change or remain the same for destination nations. for instance, tourists and the associated tourism revenues from other nations might decline due to an increase in the number of tourists from china. specifically, if the number of inbound tourists declines, this will reduce travel expenditures, and the overall tourism revenue in taiwan will decrease. however, this pessimistic situation may not occur. an increase in the number of tourists from china might attract more tourists from other nations. the purpose of this study is to employ an extended gravity model (egm) to explore the relationship between the change in the number of inbound tourists and the corresponding tourism revenue from china and from visitors from other major nations to taiwan in - under china's open-door policy to taiwan. to the best of our knowledge, this study is the first to analyse the change in the number of tourists to taiwan and tourism revenue under the open-door policy. the innovation of this study is that a policy factor imposed by a nation other than taiwan is included in the egm. this factor means that the number of tourists to taiwan is basically controlled by other nations. the analysis in this study not only empirically allows us to identify the impact of a particular factor in the egm but also scientifically provides deeper insight into tourism management in the egm. the simulation for - observes the sustained impact in the number of tourists visiting taiwan and the change in tourism revenue for different nations under this policy. the remainder of this paper is arranged in four sections. the second section presents the egm for inbound tourists to taiwan proposed in this study. the third section indicates the selection of variables and data sources used in the empirical analyses. the fourth section presents the results and discussion. the final section proposes a conclusion. according to the world tourism barometer prepared by the unwto ( ), world tourism can be classified into five regions: europe, america, asia, the pacific islands, and the middle east. in , the total number of tourists in asia and the pacific island regions was only . million, but this number dramatically increased to million in . these regions had a rapidly increasing tourism market. approximately . million tourists from taiwan to japan in accounted for . % of the total outbound tourists from taiwan. japan was ranked fourth in the world and first in the asia pacific region as a tourist destination. among the reasons that tourists selected japan as a destination, the "attitude of the population towards foreign visitors" and "convenience of ground transportation" were ranked highest. in , japan attracted approximately . million tourists from around the world (world tourism organization unwto, ). in terms of inbound tourism in taiwan, the tourism sector started in . in the early stage, inbound tourists were mainly from the us. a large number of tourists came from japan in . to attract more tourists to taiwan, the ministry of foreign affairs in taiwan made visas free in for tourists from france, gb, germany, spain, italy, the netherlands, austria, belgium, portugal, switzerland, singapore, japan, the us, canada, new zealand, and australia. the total number of tourists was approximately . million in and increased to more than million in . there has been a significant increase in the number of tourists to taiwan in the past years. the largest number of tourists comes from japan, with . million tourists in and approximately . million in , accounting for . % and . %, respectively, of the total number of inbound tourists. the share of inbound tourists from the us was . % in and was still higher than % ( . %) in . data from the world tourism organization unwto, show that tourists who came to taiwan in - were from nations. however, only a few tourists came from many of these nations, and there was no variation over the years. tourists mainly come from nations, which are the nations used in our analyses: australia, canada, china, france, germany, gb, hong kong, indonesia, india, italy, japan, south korea, malaysia, new zealand, singapore, thailand, the netherlands, the us, the philippines, and vietnam. the unwto began recording the number of tourists visiting taiwan from china in . there are no data available from the unwto in the number of tourists from china visiting taiwan before the open-door policy (i.e., before ). as a result, data on the number of inbound tourists from china during - must be obtained from other sources. the data obtained from the mainland affairs council, republic of china, taiwan ( ) and from the ministry of the interior national immigration agency, republic of china, taiwan ( ) are the data used in this study. table provides the number of inbound tourists to taiwan for the abovementioned nations and a group of other nations in - . table shows that the number of tourists from china constituted approximately . % of the total tourists visiting taiwan in and slightly increased to . % in . in the same period, tourists from japan constituted . % of tourists in and . % in . before china's open-door policy, the highest share of tourists visiting taiwan were from japan. the share of tourists from japan and china was basically stable. however, the implementation of the opendoor policy in significantly increased the number of tourists from china, which represented the largest share of inbound tourists in at . %, the second year after the implementation of the policy. in , the total number of tourists from china reached . million and constituted . % of tourists. in contrast, the share of tourists from japan significantly dropped to . %. this policy has completely changed the composition of inbound tourists in taiwan. various waves of the open-door policy have been implemented since . the first wave began for people from provinces with tour groups in and out of taiwan. the second wave extended to provinces in . the policy was further extended to provinces in . people from beijing, shanghai, and xiamen were allowed to travel to taiwan individually in , with a total quota of tourists per city per day. the quota was extended to tourists per day for each city, and people from tianjin, chongqing, nanjing, guangzhou, hangchow, and chengdu were included on the list. the quota for each city was further extended to tourists per day in and to tourists per day in . the wef identified three categories of factors regarding tourism competitiveness for each nation in . these categories are international openness and price competitiveness in relation to the sustainability of travel and tourism development, the availability and quality of all types of transportation, and the number of natural spots and areas as well as cultural, and known heritage sites (unwto, ). one more category, the tourism environment, which includes business security, health and human resource-related factors, was added by the wef in . from to , although the overall ranking of taiwan increased compared to the rest of the asia pacific region and taiwan was ranked among the top of nations, the overall tourism performance for taiwan as measured by the ttci was unimpressive. however, from to , some individual indexes, such as the primary educational enrolment rate, lack of malaria incidence, hiv prevalence, purchasing power parity, and fixed telephone lines, were ranked in the top during this period. mobile network coverage was ranked number one globally in . the gravity model is a commonly used model for issues related to immigration activities such as international trade or transportation in travel. marti and puertas ( ) used a gravity model with the ttci to examine the competitiveness of tourists in the euro-mediterranean region. the performance of the ttci is used as a tourism industry development guideline for many developing nations, although some indices have been adjusted to fit nations' unique concerns (lall, ) . cîrstea ( ) used the ttci to analyse the most competitive nations, including france, germany, the us, japan, and singapore, and concluded that these nations were not a homogenous group. that is, differences exist among the nations, and each nation has its own advantages. in addition to considering traditional variables (i.e., the gdp, population, and distance between sites), bikker ( ) extended the traditional gravity model to include variables that have special or particular meaning for sites (i.e., nations) and called it the extended gravity model (egm). the estimation of the egm can determine factors that influence international trade, and the model can be applied to tourism. park and jang ( ) used the egm to analyse nations from to and found that the major factors were not only the gdp, population, and distance but also natural and cultural resources, infrastructure for tourism, price competitiveness, and political and policy factors (e.g., the process of applying for a visa). certain types of infrastructure, such as public transportation, have been included in the gravity model (e.g., khadaroo and seetanah's study, ) to study their effect on tourists. moreover, both economic and non-economic factors affect tourism. vietze ( ) noted that a common culture, such as the same or a similar language, was a decisive factor for the selection of visiting nations. climate factors may also affect tourism demand (cohen & cooper, ; lorde, li, & airey, ; yingsha, li, & wu, ) . past literature shows that the application of the egm to tourism issues mainly addresses the identification of the major factors that influence travel and tourism decisions. moreover, the application of the egm in past studies has been used to explore the competitiveness among various nations. these nations usually have their own advantages and disadvantages in attracting different types of tourists in different tourism industry development periods. thus, the nations used for comparison are those with similar levels of incomes or in close geographical locations, such as travel among developed nations or among nations in the euro-mediterranean region. under these circumstances, the analysis can reduce the impact of two essential factors in the gravity model, income and distance, to their minimum. the effect of other particular factors that attract tourists can thus be presented clearly from the egm analysis. furthermore, past research has used the egm to analyse the attraction of each tourism destination to different nations. there is no study in the current literature that utilizes the egm to analyse a policy factor imposed by other nations with an impact on the destination nation. thus, a policy factor imposed by china that influences inbound tourists to taiwan is included in the egm employed in this study. moreover, typical factors of the gravity model, the gdp and population of visiting nations and the destination nation and the distance between visiting nations and the destination nation are included. the model is extended to contain factors s ij;t evaluated by the wef for various years that have been deemed to have relatively prominent performance for taiwan since . in this study, tourist ij;t is the total number of tourists from nation or region i to destination nation or region j in year t. the income and population for the visiting nation and the destination nation in certain time periods are c gdp i; t , t gdp j;t and c pop i;t , t pop j;t , respectively. distance ij is the distance between each the number of tourists visiting taiwan from these nations constitutes % of the total number of tourists visiting taiwan, according to the unwto ( ). visiting nation or region i and the destination nation or region j. normally, the distance between the two will not change; thus, the distance will not vary over time t. when more provinces are approved to travel in groups or individually to taiwan under the open-door policy, there will potentially be more tourists visiting taiwan. thus, the actual number of tourists from china in a specific time period is touristchina t , a policy factor mentioned above, used as a proxy for the degree of openness of this policy. the estimated coefficient of touristchina t measures the competitiveness or complementarity of tourists from china with those from all other nations. the general egm used in this study is presented in eq. ( ): to observe the impacts of the open-door policy on the number of tourists and tourism revenue from other nations, the magnitude of the variable (touristchina t ) indicates the degree of openness. since the observation focuses on china's policy and its impact on all other nations visiting taiwan, the population and gdp for taiwan vary by year. moreover, under the same conditions, distance is a key factor for the selection of tourist destinations. normally, a short distance between the visiting nation and the destination nation is considered more advantageous than a long distance between the two nations (kozak et al., ; nicolau & mas, ) . the other extended variables used in the egm include english as the official and/or national language in the visiting nation (language i ). the relative advantage indices for taiwan presented in the ttci include "malaria incidence," "primary education enrolment rate," and "purchasing power parity." however, the above indices used in the ttci are for universal comparison purposes. as compulsory education in taiwan is junior high school (grade or in some nations) and approximately % of junior high school pupils continue their education to senior high school, the use of senior high school is difficult to compare over time. therefore, the "university rate" (university rate) is used for this variable. however, the results of the preliminary test indicate that this variable is highly correlated with taiwan's gdp. thus, university rate is dropped from further analysis. as with purchasing power parity, the consumer price index (cpi) is used specifically for this variable (arsad & johor, ; craigwell, ; morley, ) . malaria has not been a problem in taiwan for decades. one recent outbreak of infectious disease involved severe acute respiratory syndrome (sars), which occurred in and lasted until in taiwan. a dummy variable is set as for the period - and for others to detect whether the sars outbreak influenced the number of inbound tourists. the notations, definitions, and mean values of all variables used in the estimation are listed in table . once all the variables are prepared, the specific functional form of double log is set for the egm in this study as in eq. ( ). we assume that the impact of tourists from china is not in a single direction, and the total effect on the number of tourists from other nations is a combination of the linear and square terms of the number of tourists from china lnðtouristchinaÞ. thus, the quadratic form allows us to consider the possible existence of nonlinearities in the effect of the number of tourists from china. although the combination of nations and time periods is typical for panel data, this analysis can theoretically be achieved using a fixed-effect or random-effect model. however, some variables, such as distance lnðdistanceÞ and language (language), will not change for any nation over time. thus, the fixed-effect model is not appropriate when these variables are included (liu, lai, & chen, ; prehn, brümmer, & glauben, ) . as such, the random effect model is favoured under such circumstances. the estimated coefficients from the random effect model for eq. ( ) are listed in table . the results of the estimation show that except for the population in taiwan (lnðt popÞ), lnðt cpiÞ, and sars, all other variables are significant at different significance levels. moreover, the effects of these significant variables on the number of tourists from the nations to taiwan are consistent with our expectations. there are fewer tourists to taiwan from the farthest nations. nations with high gdps have more outbound tourists visiting taiwan, and tourists from countries where english is the official and/or national language visit taiwan more often than those from other countries. the effect for the dummy variable of language (language) and the outbreak of sars (sars) is the magnitude of the corresponding estimated coefficient. because all other variables are taken as the natural logarithm, the effect of each variable on the this situation occurs frequently in the gravity model. it normally involves variables that are constant throughout the years. thus, the distance proxy variable between two nations, a typical variable used in the gravity model, has no variation throughout the years. due to this drawback, the fixed-effect model is not appropriate for this application. number of tourists from the nations means that a % change in a certain variable will result in a certain percentage change for the number of total tourists from the nations other than china (ln tourist). although the double log form can be used to compute the elasticity for the variable of lnðtouristchinaÞ, it is not the purpose here. the main purpose of this study is to observe the effect for every unit (person) change of tourists from china on the unit change of tourists from other nations. thus, the marginal effect is computed by taking the derivate of total tourists from other nations (tourist) to tourists from china (touristchina). to observe the impact of a one-unit change of tourists from china on the change of tourists from other nations or regions, the marginal effect (me t ) plays a role. this effect is shown in the estimation of eq. ( ), which accounts for all related factors that influence the number of tourists from all nations except those from china and the interaction between tourists from china and those from the other nations. as a result, the marginal effect is computed as eq. ( ): ½ : þ ð * : Þ*lnðtouristchinaÞ t �: ( ) thus, tourist i;t represents the average number of tourists from any specific nation or region of interest. the sign of the marginal effect is determined by the negative part and the positive part of eq. ( ). the turning point is the number of tourists from china, which reaches a threshold that switches the impact from positive to negative or vice versa. that is, when the number of tourists from china is above , , the increase of tourists from china will concurrently increase the number of tourists from other nations or regions. in contrast, when the number of tourists from china is below , , tourists from china increase under this ceiling, and tourists from other nations or regions will competitively increase. the following analyses are employed to determine the impact of a change in the number of tourists from china on the number of tourists from the seven nations, areas, or regions. the selection of the nations or areas represents the highest share of tourists among all nations, such as japan, until or countries that had a significant increase in the number of tourists in recent years, such as south korea and hong kong. the other nations are classified according to their geographical location. the regions are australasia, including australia and new zealand; north america, including canada and the us; europe, including france, gb, germany, italy, and the netherlands; and south and southeast asia, including india, indonesia, malaysia, the philippines, singapore, thailand, and vietnam. the marginal effect of every change in the number of tourists from china on the number of tourists from each nation, area, or region can be computed by ( ) based on the mean number of tourists from each nation, area or region, tourist i;t , in the last three years ( ) ( ) ( ) . the results are presented in part b of table . the marginal effects for japan, south korea, hong kong, australasia, north america, europe, south and southeast asia are . , . , . , . , . , . , and . , respectively. the results indicate that the number of tourists from japan will increase . for each additional tourist from china. the explanation for the other marginal effects is the same. the positive marginal effect implies that there is no competitiveness between tourists from china and those from any other nation, area, or region stated above. this is because the variable representing the number of tourists from china (lnðtouristchinaÞ) has negative linear and positive square terms. the turning point for this curve is , tourists from china. when the number of tourists from china falls below this number, the number of tourists from the above seven nations or regions will also decrease. without considering the effect of china, the actual rate of the increase (decrease) of the number of tourists from each nation or region i for the last three years can be computed by taking the average rate between and (denoted as γ i; ). the results are presented in part a of table . table shows the significant decrease in the number of tourists from china in and . thus, it is crucial to determine the potential impact on the number of tourists from other nations or regions due to this noticeable decline in the number of tourists from china over the next few years. this can be accomplished by simulating the number of tourists from the above seven nations and regions. the increase in the rate of tourism between two different years has higher variation than the average rates for the change in the number of tourists for successive years, - . the use of the average rate provides a relatively reliable tourist change rate. thus, it is assumed that the rate of increase (decrease) in the number of tourists for each nation or region has the same rate as in - , γ i; . eq. ( ) source: data on the average rate of increase in tourism in - were computed based on data obtained from the unwto ( ). numbers with *, **, and *** indicate that the estimated coefficients are significantly different from zero at the %, %, and % significance levels, respectively. note: a the numbers in parentheses are the standard deviations of the corresponding estimated coefficients. other individual nations or regions is positive and has high variation. the highest increase rate is for south korea, and south and southeast asia is ranked second. the lowest increase rate for tourists visiting taiwan is for hong kong. due to the extreme decrease of tourists from china, the total number of tourists from china has a share of . % of the simulated rate for each nation or region. this share is far below the . % in shown in table . this means that tourists visiting taiwan mainly come from nations or regions other than china, with an increase in the rate of tourism for each nation or region. the simulation can also be accomplished by accounting for the marginal effect at the - level based on a change in the number of tourists from china. thus, the simulated number of tourists for each nation or region is computed as in eq. ( )-( ) and eq. ( )-( ): ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) , and data on the length of stay were obtained from the taiwan statistics database of the taiwan tourism bureau ( ). the results of the simulation for total tourism revenue considering the average rate of increase in tourists and table under the corresponding marginal effect of each nation or region. the actual number of tourists from china in is the highest on record. the average simulated numbers of tourists from all nations and regions for - are shown in table for reference. the share of inbound tourists from china in was approximately %, and that of the remaining seven nations was %. the situation reverses in - , when the number of tourists from china continues to decrease and accounts for only %, whereas the share of the other seven nations or regions increases to more than %. the discrepancy between the actual number of tourists from the other seven nations and regions is shown in figs. - . all the figures show that most of the simulated total numbers of tourists from each nation and region have similar patterns, with two exceptions. that is, the simulation of the total number of tourists based upon the average increase (decrease) rate of - is higher than the number simulated by the marginal effect of each nation or region from the change in the number of tourists from china for the coming four years, - . furthermore, all figures demonstrate that the influence of tourists from china through marginal effects on the other nations or regions can be divided into two groups. one group includes the total number of tourists visiting taiwan from japan, hong kong, australasia, some nations in europe, and nations in north america, and the other group includes south korea and various nations in the south and southeast asia regions. the dotted line in each figure shows that the number of tourists from the first group of nations or regions is affected to different degrees by a decline in the number of tourists from china depending upon the curvature of the line. the number of tourists from south korea and the south and southeast asia regions will continue to increase and will not be affected by a decline in the number of tourists from china. regardless of whether there is an increase or decrease in the number of tourists from any nation or region, we are concerned with determining whether tourism revenue might change. to compute the tourism revenue for the next years simulated by either method, data for daily expenditures and length of stay are required. these data were obtained from the taiwantourism bureau, republic of china (taiwan) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) , and the taiwan statistics database of the taiwan tourism bureau ( ) . data on daily expenditures were obtained from a routine survey, and data on the length of stay were included in a long-term record compiled by the taiwan tourism bureau. table lists the last three years of data on daily expenditures and length of stay. the weighted daily expenditure is computed for each region composed of more than nations. the daily expenditure deflated by the cpi is computed for china and the other seven nations and regions for the last years ( - ). we then compute the average of three years of daily expenditures, as shown in table . the annual tourism revenue for a specific nation can be obtained by multiplying the daily expenditures, length of stay, and total number of tourists in a year. each component must be calculated before the corresponding tourism revenue is computed for the simulated years, - . we assume that the daily expenditures are inflated at the same rate as - . the inflated daily expenditures are shown in part b of table . similar to the length of stay, it is assumed that tourists from china and from the seven other nations and regions stay as long as the average days for the - period, which is shown in table . total tourism revenues for china and the other seven nations and regions are then calculated by multiplying part b for and part c in table and the simulated number of tourists and by considering the average rate of increase for the period - , which is shown in part a of table . a similar procedure is used to calculate tourism revenue for the simulated number of tourists, accounting for the average marginal effect of - , which is shown in part b of table . the simulated tourism revenue results for china and all other nations and regions are shown in table . the results show that the tourism revenue from china in the coming four years represents only % of the total tourism revenue of taiwan. the amount is million us$ less than that in and million us$ less than the average for the period - . these results indicate that tourism revenue from china is consistently declining due to the noticeable decrease in the number of tourists. however, the simulated tourism revenue from the other seven nations and regions for - is million us$ higher than that in and million us$ higher than the average for the - period. although the number of tourists for most of the nations and regions is concurrently declining because of the interaction through marginal effects in the decrease in the number of tourists from china, the number of tourists from south korea and south and southeast asia will continuously increase in the next four years. this increase in the number of tourists will lead to an increase in the corresponding tourism revenue from south korea and south and southeast asia. the results shown in table are used in fig. to compare the number of tourists from china and the share in its high peak year, , and all four simulated years, - , with those from all other sovereign nations and regions. a similar comparison can be conducted for tourism revenue and its corresponding share to total tourism revenue in the same year for the nations and regions. we find that the actual total number of tourists from china decreases rapidly from its highest point in - (the year for which the most recent data are available) by approximately %. if the number of tourists from china continues to decline (as the simulated results indicate it will), the share of tourists from china will decrease to % by . the corresponding tourism revenue will then drop from % of total tourism revenue in taiwan in to % in . however, this decline will decrease the number of tourists travelling to taiwan from some nations and regions but will increase the number of tourists from south korea and many nations from south and southeast asia. that is, the negative effect of suppressing both the number of tourists and tourism revenue from some nations and regions is offset by the positive effect of promoting tourists and tourism revenue from other nations. the open-door policy implemented by china in significantly increased the number of tourists from china who visit taiwan. in , the number of tourists from china reached its highest level and its largest share, %, of all tourists to taiwan. however, this policy factor, which has highly political connotations, switched its direction when the ruling party in taiwan changed in . this policy promoted a large number of tourists to taiwan. it is important to identify the impact of this policy when operated in opposite directions by china. the impact will reveal not only the number of inbound tourists through competitiveness or complementarity between china and the other major nations visiting taiwan but also the amount of tourism revenue. simulation is employed to observe the impact of manipulating the open-door policy for - . the results show that if the number of tourists from china is above , , then the number of tourists from the other individual nations or groups of nations will increase. it seems optimistic to have a larger number of tourists from the other nations when there are more tourists from china under its open-door policy to taiwan. however, taiwan will inevitably be faced with fewer tourists from all other nations (i.e., japan, hong kong, australasia (new zealand and australia), north america (canada and the us), and europe (france, germany, gb, italy, and the netherlands)) as china reduces its number of tourists. among the major nations visiting taiwan, only tourists from south korea and from south and southeast asia will consistently increase regardless of whether the number of tourists from china is more or less than , . similar results are found for the change pattern of tourism revenue. it is difficult for taiwan to expect good intentions from china that will allow more tourists to visit taiwan and will complementarily bring more tourists from other nations. the results clearly indicate that taiwan must identify the reason for the increase in tourists from all other nations. to minimize the impact of china's open-door policy on the number of tourists from all other nations (regions), the best strategy for taiwan is to promote different factors to attract tourists from nations other than china. if the current preparation and arrangement of travel and tourism facilities is specifically designed or developed for china due to its large number of inbound tourists, then other nations have the opportunity to use them only incidentally. this makes other nations a spillover beneficiary of travel to taiwan. this is not an effective way of developing the tourism industry for taiwan in the long term. because inbound tourists from different nations have different preferences and tastes for tourism facilities and installations, such as hotels, motels, and public transportation, the preparation of different types of hardware and software facilities suitable for tourists from different nations around the world is essential. relying on a policy imposed by other nations to bring taiwan an abundant number of tourists is an unwise and passive decision. the development and improvement of travel and tourism facilities for tourists from different nations is a constructive way to produce a competitive relationship for the number of tourists and tourism revenue between china and other nations. there are some limitations of the methodological perspective in this study. first, the distance between the capital of taiwan and that of a specific country is a proxy variable of travel cost and is constant over time; thus, the model used here, like most other gravity model applications, cannot take into account the effect of travel cost variation for travel and tourism to different destinations. if data for flight routes to different destinations are available, the travel costs from the gasoline use of aircrafts travelling at different times to different nations might replace the current constant distance variable between the capitals of nations. if this is possible, the creation of this variable requires high demands for data. second, the estimated coefficients of variables in the conventional gravity model only present the mean effects for tourist numbers and fail to capture out-of-average differentiations. an advanced method, such as quantile regression, is a possible solution to this problem and could be used in future research. je-liang liou, pei-chun hsu, and pei-ing wu have brainstorm to come out with this topic. je-liang liou contributes all the software management regarding the estimations. pei-chun hsu then develops her specialty in all the computation of number of tourists and the associated tourism revenue of each nation and/or region for all the scenarios simulated in this study. pei-ing wu frames each section of this manuscript and writes the draft for the full manuscript. the meaningful analyzed contents, such as design of the simulation scenarios and the utilization of the estimation outcomes, are certainly from the frequent back and forth discussion among three authors. pei-chun hsu has a degree in business administration and another degree in agricultural economics. her major work is in administration at the accounting office at national taiwan university (ntu). this work involves accounting and auditing budget control. before joining the accounting office at ntu, she worked in the industrial part of a semi-conductor in charge of financial affairs and negotiations with all types of firms. this study aligns with her focus on managing tourism revenue computation and arranging various types of computation. pei-ing wu is a full professor of the department of agricultural economics, national taiwan university (ntu), and has taught at ntu for nearly years. dr. wu has completed many studies in environmental evaluation with various types of environmental goods and services. her other specialties include natural resource economics, consumer economics, quantitative methods, consumer economics, and research methodology. she has published more than journal articles and book chapters on these topics in both mandarin and english. she is also an author or co-author of several books on her specialties. each year, she conducts at least one project. the implementation of these projects leads to good connections and interactions with graduate students. developing and testing a tourism impact scale fiji's tourism demand: the ardl approach to cointegration estimating european tourism demand for malaysia an international trade flow model with substitution: an extension of the gravity model travel & tourism competitiveness: a study of world's top economic competitive countries language and tourism tourism competitiveness in small island developing states. world institute for development research market shares analysis: the case of french tourism demand determinants of international tourism: a three-dimensional panel data analysis tourism destination competitiveness: a quantitative approach tourism core and created resources: assessment on travel and tourism competitiveness index (ttci) ranking and tourism performance gravity analysis of tourism flows and the 'multilateral resistance to tourism'. current issues in tourism the influence of sociopolitical, natural, and cultural factors on international tourism growth: a cross country panel analysis the role of transport infrastructure in international tourism development: a gravity model approach competitiveness of overseas pleasure destinations: a comparison study based on choice sets intracultural variance of chinese tourists in destination image project: case of queensland competitiveness indices and developing countries: an economic evaluation of the global competitiveness report how are jobs and ecosystem services linked at the local scale? trade effects of regional trade agreements on taiwan: an empirical study using gravity model modeling caribbean tourism demand: an augmented gravity approach statistics: preliminary statistics of cross-strait economic relations factors of tourism's competitiveness in the european union countries determinants of tourist arrivals in european mediterranean countries: analysis of competitiveness statistics: . foreign residents by nationality the use of cpi for tourism prices in demand modelling tourism and urban economic growth: a panel analysis of german cities the influence of distance and prices on the choice of tourist destinations: the moderating role of motivations the influence of sociopolitical, natural, and cultural factors on international tourism growth: a cross country panel analysis an extended gravity model: applying destination competitiveness gravity model estimation: fixed effects vs. random intercept poisson pseudo-maximum likelihood tourism demand modelling and forecasting: a review of recent research tourism yearbook inbound visitors: length of stay small sample evidence of the tourism-led growth hypothesis in lebanon annual survey report on visitors expenditure and trends in taiwan cultural effects on inbound tourism into the usa: a gravity approach cultural connectedness and visitor segmentation in diaspora chinese tourism the travel & tourism competitiveness report the global competitiveness report the global competitiveness report the global competitiveness report the global competitiveness report the global competitiveness report - the global competitiveness report the global competitiveness report the global competitiveness report the global competitiveness report the global competitiveness report why tourism? geneva: world tourism organization. available at: . (accessed why tourism? geneva: world tourism organization world tourism barometer. geneva: world tourism organization yearbook of tourism statistics dataset the impacts of cultural values on bilateral international tourist flows: a panel data gravity model liou's specialties include resource & environmental economics and production economics. his research interests include cost-benefit analysis (cba); non-market evaluation, efficiency and performance measurement; and ghg policy assessment. dr. liou has published more than academic articles (in mandarin and english) we sincerely appreciate the generous and kind offer for the data used in this study by world tourism organization (unwto) and taiwan tourism bureau, republic of china (taiwan). without the support of these data, the accomplishment of this study will not be possible. nation or region total tourism revenue from nations or regions key: cord- -sfr x ob authors: röst, gergely; bartha, ferenc a.; bogya, norbert; boldog, péter; dénes, attila; ferenci, tamás; horváth, krisztina j.; juhász, attila; nagy, csilla; tekeli, tamás; vizi, zsolt; oroszi, beatrix title: early phase of the covid- outbreak in hungary and post-lockdown scenarios date: - - journal: viruses doi: . /v sha: doc_id: cord_uid: sfr x ob covid- epidemic has been suppressed in hungary due to timely non-pharmaceutical interventions, prompting a considerable reduction in the number of contacts and transmission of the virus. this strategy was effective in preventing epidemic growth and reducing the incidence of covid- to low levels. in this report, we present the first epidemiological and statistical analysis of the early phase of the covid- outbreak in hungary. then, we establish an age-structured compartmental model to explore alternative post-lockdown scenarios. we incorporate various factors, such as age-specific measures, seasonal effects, and spatial heterogeneity to project the possible peak size and disease burden of a covid- epidemic wave after the current measures are relaxed. a cluster of pneumonia cases of unknown origin was detected in wuhan city, the capital of hubei province, china, with a population of million in december . on december china alerted the world health organization (who) china country office [ ] . on january the causative pathogen of the pneumonia outbreak was identified as a novel coronavirus, and, on february, the who officially named the novel coronavirus as sars-cov- and the disease it causes as covid- . sars-cov- infection quickly spread from china, where it emerged in december , to europe, where the first cases were confirmed on january in france (where, later in april, covid- was retrospectively confirmed for a patient hospitalized in late december ) [ , ] . around the same time, on january, the first infection in germany was confirmed in bavaria that led to a local outbreak. by february , subsequent cases have been confirmed and high-risk contacts have been identified via agile contact-tracing [ ] . the first epidemic in europe started in the lombardy region of italy with the first detection on february [ ] . the who director-general declared the covid- outbreak a public health emergency of international concern under international health regulations ( ) on january [ ] and then a pandemic on march [ ] . by that time, the number of daily new cases of covid- was over in several countries, including italy, france, and germany. manipulation and shiny version . . . [ ] for creating an interactive dashboard to carry out epidemiological analyses online (available in hungarian [ ] ). the full source code of this dashboard and related analysis is available at [ ] . effective reproduction number (r t ), the average number of secondary cases per primary case for those primary cases who turn infectious on day t, was tracked in real time based on the daily number of reported new cases using the methods of cori et al. [ ] and that of wallinga and teunis [ ] , among the several methods aimed to estimate r t [ , ] . in brief, the method of cori et al., is based on calculating the ratio of the actual number of infections on a day to the total infectiousness of all past cases on that day. thus, it measures r t by assuming that infected individuals will infect in the future as if conditions remain unchanged. in contrast, the method of wallinga and teunis makes no such assumption; it uses a likelihood-based inference on the possible infection networks underlying the epidemic curve. the fundamental difference is that the method of cori et al., solely uses past information ("backward looking approach"), due to which the result is sometimes called instantaneous reproduction number, while the wallinga-teunis method more closely corresponds to the concept of the usual definition of effective reproduction number; however, it requires future information in exchange ("forward looking approach"). for a discussion on the relative merits of these two approaches, see [ , ] . the wallinga-teunis method was used with the addition of cauchemez et al., who aimed to provide real-time estimation capability [ ] . both methods require-in addition to incidence data-information on the serial interval. depending on the used dataset, different estimations of the serial interval have been published: a mean of . days was found in [ ] , and . days in [ ] . here, we assume an intermediate value following [ ] , where the mean and standard deviation (sd) of the serial interval were estimated at . days ( % cri: . , . ) and . days ( % cri: . , . ), respectively. (the serial interval is assumed to follow gamma distribution.) they also concluded that the serial interval of covid- is close to or shorter than its median incubation period, which is coherent with our choice of parameters in the transmission dynamics model. the estimation was carried out using r packages r version . - [ , ] and epiestim version . - [ , ] . case fatality rate (cfr) is defined as the (conditional) probability of death from a disease for those contracting the disease (for diseases where asymptomatic state also exists, infection fatality rate (ifr) is defined analogously) and is estimated as the ratio of cumulative deaths and cumulative cases. this definition, i.e., where c t and d t are the daily, c t and d t are cumulative number of cases and deaths, respectively, on day t is however biased when used during the epidemic (thus the name crude/naive cfr or ncfr). the reason for this is that a proportion of cases counted in the denominator will die (in the future), thus they should have been counted in the numerator as well, but, as they're not, the ratio underestimates the true value [ , ] . fortunately, it is relatively easy to correct for this bias using information on the distribution of the diagnosis-to-death time [ , ] . the likelihood that the cumulative number of deaths on day t is d t is given by where f i denotes the (conditional) probability that death happens on day i after onset (for those who die) and π stands for the true value of the cfr. this observation allows both maximum likelihood and bayesian estimation for π using the observed series of c i and d i , the latter of which was employed in the present study, using a beta( , ) (i.e., uniform) prior. it was assumed that diagnosis-to-death time follows lognormal distribution with a mean of days and a standard deviation of . days, as found by linton et al. [ ] . the bayesian estimation was manually coded using the r package rstan version . . [ ] . a markov chain monte carlo approach was used to carry out the estimation with no-u-turn sampler, using chains, warmup iterations and iterations for each chain. the cfr mentioned in the previous subsection is still not the true value of the fraction of fatal outcomes of all infections, as there is another source of bias, but this time leading to overestimation: the underascertainment of cases. this is a substantial issue now as a-precisely not yet known, but epidemiologically significant-fraction of the covid- cases are asymptomatic or mildly symptomatic. since in many countries testing was extended to contacts (and in a few instances, even random sampling was carried out), the confirmed cases include some asymptomatic cases as well. however, the value of the estimated (corrected) ifr can also be used to estimate the ascertainment rate: by assuming that the ifr in reality takes a benchmark value (one derived from large-sample, well-designed studies accounting for underascertainment or sero-epidemiological surveys) and-crucially-assuming that the difference of the actual estimated ifr from that value is purely due to underascertainment. then, the ascertainment rate can be obtained by simply dividing the assumed true value of the ifr with the actual estimated cfr [ ] . note that this might be a strong assumption, as it rules out that there is a real difference in the country's ifr from the benchmark value; in particular, it rules out different virulence of the pathogen, different age-and comorbidity-composition in the country and different effect of the healthcare system on survival. mathematical models have been developed to better understand the global spread [ ] and the transmission dynamics of covid- for many countries, including australia [ ] , france [ ] , germany [ ] , uk [ ] , and the usa [ , ] . such models have been used to project the progress of the outbreak and to estimate the impact of control measures on reducing disease burden. the two most common approaches are compartmental models formulated as systems of ordinary differential equations, and agent based models used to generate an ensemble of stochastic simulations for possible outcomes. here, we establish a compartmental population model, adjusted to the specific characteristics of covid- , considering the following compartments. we denote by s the susceptibles, i.e., those who can be infected by the disease. latents (l) are those who have already contracted the disease but do not show symptoms and are not infectious yet. in accordance with studies indicating that viral shedding peaks before the onset of symptoms [ ] , in our model, we have introduced the presymptomatic infected compartment i p for those who do not have symptoms, but who already are capable of transmitting the disease to susceptibles. we divided the latent period into two compartments l and l , thus, together with i p , the incubation period follows a hypoexponential distribution, having a shape matching empirical observations [ , ] . since a large fraction of infected shows only mild or no symptoms, after the incubation period, we differentiate these individuals from those with symptoms. we assume a gamma-distributed infectious period with erlang parameter m = , similar to the sars study [ ] , hence, we have three classes for both asymptomatic and symptomatic infectious individuals (i a, , i a, , i a, and i s, , i s, , i s, , respectively). individuals from the i a, compartment will all recover and hence proceed to the recovered class r. immunity is assumed for those who have recovered from the disease, at least for the time scale of this modeling. individuals from i s, may either recover without requiring hospital treatment (and thus move to r) or become hospitalized. it is of crucial importance to project the number of hospital beds and intensive care unit (icu) beds needed; thus, in the model, we further differentiate symptomatically infected individuals who need hospital care and critical care, denoted by i h and i c , respectively. we operate with the assumption that the healthcare system will not be overwhelmed, and thus disease-induced death is only considered from critical care that fits with the data obtained from nphc. hence, individuals from i h will proceed to r after recovery. those from i c with fatal outcome transit to the d compartment. those who are out of icu and on the path to recovery are collected into the i cr , from where they eventually recover and move to the r class. to take into account the different characteristics of the disease in various age groups, we stratified the hungarian population into seven groups, corresponding to the available choices in the hungarian online questionnaire for the assessment of changes in the number of contacts following the lockdown [ ] . the compartments listed above corresponding to the different age groups are denoted by an upper index i ∈ , . . . , . accordingly, all of our parameters can be calibrated age-specifically. the transmission rates from age group k to age group i are denoted by β (k,i) j , with j ∈ {p, a, s}, where the three subscripts p, a, s stand for presymptomatic, asymptomatic, and symptomatic infected, respectively. the parameters described in the following all have an upper index i which stands for the corresponding age group. a fraction p i of exposed people will not show symptoms during his/her infection, while ( − p i ) will develop symptoms. the average length of the incubation period is (α i l, ) − + (α i l, ) − + (α i p ) − days, with the transition rates α i l, , α i l, , α i p , respectively. similarly, the average infectious period of asymptomatic and symptomatic infected individuals are with the corresponding transition rates, respectively. a fraction h i of the infectious compartment i i s, will be hospitalized, the remaining fraction − h i will recover without hospital care. out of those who need hospitalization, a fraction ξ i needs intensive care. for the hospitalized classes i i h , i i c , i i cr , the average time spent in these compartments is given as (γ i h ) − , (γ i c ) − and (γ i cr ) − , respectively. a fraction µ i of those leaving the i i c compartment will die due to the disease, while the remaining fraction will proceed to the i i cr class. the transmission dynamics of our model for one age group is illustrated in figure . . reproduction numbers are calculated using the next generation matrix method in section . . . we discuss the application and, then, some limitations of this model in sections . . and . . , respectively. the codes were implemented in wolfram mathematica and are available at [ ] . the governing equations of the transmission model described in section . take the form where the index i ∈ { , . . . , } represents the corresponding age group. next, we add the spatial locations of the population to the previous model. the population is divided into patches, where each patch represents a separate geographic region. within each region, we use the same compartmental model (but possibly with different parameters), and we also include spatial movement of individuals between the patches. the governing equations of such a metapopulation model, where p ∈ { , , . . . , #patches} are we have chosen our model parameters based on comprehensive literature review and present them here, except the transmission rates β (k,i) s,_ which are left for section . . . for the incubation period, we assume hypoexponential (generalized erlang) distribution with parameters ( . , . , ). this way, the average incubation period is . days: the same length and very similar shape of the probability distribution function was estimated in [ ] , and this distribution has the observed concavity properties as well (see [ ] ). in addition, this estimation is consistent with [ ] , and such values have been used in [ , , , ] . the first . days are the latent period [ ] and the past two days are the presymptomatic period [ ] , when transmission is already possible with similar rate as at symptom onset [ ] . therefore, we use the same transmission rates for the presymptomatic and symptomatic infectious periods. for the transmission rate of asymptomatic infected individuals, we use a reduction factor . [ , , ] . for the length of infectious periods (both symptomatic and asymptomatic), we assume a gamma distribution with erlang parameter (coherent with the sars study [ ] ), and an average length days of infectivity. although full recovery and viral shedding may take much longer, the infectiousness throughout the course of infection is mostly concentrated to this period [ , ] . the choice of days is also justified by [ , ] , who estimated that around % of transmissions occur during the presymptomatic period, and it is also within the range of infectious periods used by [ , ] . the average stay in hospital is assumed to be days, in accordance with the seven days median reported in [ ] using over , patients' data in the uk. similarly, the average duration of critical care is assumed to be days, in accordance with the intensive care national audit & research center (icnarc) report [ ] . very similar numbers were reported in the us [ ] , and were used in other modeling studies [ , , ] . for those who recover from intensive care, we assumed a -day hospitalized rehabilitation period. the periods above associated with the average time an individual spends in each compartment over the course of the infection are age-independent and summarized in table . table . age-independent epidemiological parameters of covid- . assumed to be valid for all age groups. references and explanations are in section . . . incubation period (α i l, ) − + (α i l, ) − + (α i p ) − . days latent period (α i l, ) − + (α i l, ) − . days presymptomatic (infectious) period (α i p ) − . days infectious period of i i a (γ i a, ) − + (γ i a, ) − + (γ i a, ) − . days infectious period of i i s (γ i p, ) − + (γ i p, ) − + (γ i p, ) − . days hospitalization (γ i h ) − presymptomatic vs symptomatic β next, we discuss the age-specific parameters, which are mostly related to the outcome of infections. we stratified the population into the following seven age groups: - , - , - , - , - , - , + years old. using the data from the hungarian central statistical office (ksh), we obtain the division shown in table . according to [ ] , a fraction . of infected children (under years old) are asymptomatic or mild cases. this value was used in [ ] as well. we set the probabilities of the infection following mild or asymptomatic course in an individual according to weitz et al. [ ] . the probabilities of hospitalization given infection h i and of requiring intensive care in addition ξ i are based on the work of moss et al. [ ] . the ratios of fatal outcomes µ i are derived from the icnarc report [ ] comprising icu case reports from uk. all these age-dependent parameters are listed in table . for creating our contact matrix m cont , we have utilized the work by prem, cook, and jit [ ] , where the estimated matrices are written for age groups, namely - , - ,. . . , - , +. as we have divided the hungarian population into seven age groups, see table , we aggregated the higher resolution data. first, we derived a symmetric matrix m total with elements where m = [m i,j ] is the original contact matrix and [p i ] is the age distribution of hungary for the same age groups as in [ ] . thus, m total contains the total number of contacts among age groups in its upper triangular part (with values relative to the contact pattern in m). the total number of contacts, w.r.t. the age distribution used in our work, is then obtained by summing up the corresponding elements of this matrix of size × resulting in m total cont of size × . finally, dividing element-wise each column of m total cont by the aforementioned population vector given in yields the following × contact matrix: for more insight, we include its heatmap in figure . additional technical details are to be found in our source code available at [ ] . recall that we have assumed presymptomatic patients, which are members of classes i i p , to be as infectious as symptomatic patients. in addition, patients with no or mild symptoms (those in i i a ) possess a transmission coefficient half of the baseline. thus, our task is to give reasonable estimates for the rates β (k,i) s,_ corresponding to the transmission rate of the symptomatic individuals from age group k to group i. to that end, we follow the terminology and techniques of [ ] to compute the next generation matrix (ngm) and the baseline transmission rate β . finally, the desired coefficients are obtained by taking into account the relative contact rates between age groups via the contact matrix presented in section . . . we note that the probabilities p i have a special role during ngm computations as their effect is what ultimately specializes the resulting transmission rate matrix for covid- . first, let us consider the infectious subsystem of ( ), namely, equations describing l i . . , }. linearizing this w.r.t. the disease free equilibrium yields the linearized infectious subsystem: where the matrices t and Σ are referred to as the transmission part and transitional part, respectively; the state is described by recall that the transmission matrix t has the form on the other hand, the transitional matrix Σ is block diagonal with blocks then, the ngm with large domain is given by follows with the, again, block diagonal e with e i = [ ]. the baseline transmission rate β may be factored out from k as β hence, k = β ·k, wherek may be readily constructed and we can compute its spectral radius ρ(k). then, we obtain the baseline transmission rate using the assumed basic reproduction number r as for other scenarios, the final steps are altered to align with the desired reproduction number r, resulting in an appropriate β and then the scaled transmission rates β (k,i) s . we omit presenting all transmission matrices but give the computed baseline transmission rates in table . we use the compartmental model described above to explore possible future scenarios, assuming widespread transmission in the population. in particular, we investigate the disease dynamics when different levels of general reductions of transmission, compared to the baseline, are in place. by manipulating the contact matrix, we investigate the effect of age-specific interventions, such as school closures and special measures aimed to protect the elderly. seasonality of respiratory viruses can be attributed to a combination of factors, including the survival of the virus in different environmental conditions, changes in contact patterns (such as school holidays), less time spent in closed spaces where the highest number of transmissive contacts are made, and potentially seasonal changes in the health conditions of the population as well. to express this behavior, we define a time-dependent parameter by which we scale the transmission rate β. parameter c denotes the magnitude of the effect of seasonality on the number of contacts. using such a time-dependent transmission rate, we compare possible disease dynamics generated by the interplay of control measures with different degrees of seasonal behavior. spatial heterogeneity is also considered using our patch model, where the country is divided into distinct geographic regions (patches). the transmission dynamics is described within each patch by our compartmental model (but potentially with different parameters and age group composition), and individuals may move between those patches. for obvious reasons, individuals in compartments i i h , i i c , i i cr and d i do not travel. let travel p,q denote the number of travels from patch p to patch q. to derive travel rates t p,q for each age group i, we divide the number of travels with the population of the appropriate patch numerical simulations for such situations show the differences in the transmission dynamics, healthcare demand, mortality, and overall disease burden. these scenarios are summarized in table . our work has several limitations. due to limited testing and the large number of asymptomatic and mild cases, there was a huge uncertainty in the number of true cases, especially in the early weeks. now, with the help of [ ] , we have a good estimation of the overall ascertainment rate over this period, but it is still unclear how this rate evolved in time. the transmission model has the same weaknesses that all compartmental models have: we assume a homogeneous population with random mixing, apart from the age structure. we added some further heterogeneity in space (patch model) and time (seasonality). in our scenarios, we assumed a constant reduction in transmission, while in reality the control measures and the behavior of the people were continuously changing. hence, such scenarios cannot be considered as predictions, as we cannot expect such unchanging circumstances for months. the role of children in this pandemic is still not clear, in our modeling, we assumed that they are equally susceptible, and equally infectious once they develop symptoms, but we used an age-specific probability for developing symptoms. since our transmission model is deterministic, it is suitable only when there is significant spread in the population. for very low case numbers, the development of the epidemics is largely influenced by random events. stochastic effects are important when considering extinction or resurgence of the disease, and possible case importations after travel restrictions are lifted. however, these issues are not in the scope of the present work. the model has a large number of parameters, many of those have uncertainty. the most important ones in regard to the burden on the healthcare system are hospitalization rates, probability of intensive care need, mortality, all of those depending on age. we do not have too much data for this from hungary, hence we used parameters taken from the literature. a full sensitivity analysis is beyond the scope of this study, but we present a sensitivity chart for a crucial output of an outbreak, see section . , which is of concern in many countries: the peak icu demand, including the need for mechanical ventilators, to assure that all patients receive the necessary care, and no additional excess mortality is caused by an overwhelmed healthcare system. this was one of the key questions in other modeling studies. the sensitivity analysis was conducted by running many simulations, sweeping through a two-parameter plane, and retrieving the icu peak from each individual run. the code can be found in [ ] . the first hungarian covid- cases were reported during the first week of march through the hungarian notifiable disease surveillance system operated by nphc which is the source of data described in this section (for the most recent information, see [ ] ). the first case, an iranian -year old man (studying and residing in hungary) who recently returned from tehran, was reported on march . by may , the cumulative number of reported confirmed covid- cases were ( . cases per , population), including deaths (crude cfr . %), see figure for the daily reported numbers. out of the cases, . % ( , cases) occurred in the + age group, . % ( cases) in the - age group, . % ( cases) in the - age group and . % ( cases) among people under -years old. age specific morbidity was highest in the + age group ( . cases per , population) and more than twice of the overall in the - age group ( . out of deaths, . % ( deaths) belonged to the + age group. as seen in figure , the highest crude cfr was observed in the + age group ( . %), followed by the - age group ( . %) and the - age group ( . %). no deaths were reported under years of age, see figure . additional details are provided in table . out of the cases, . % ( cases) were female and . % ( cases) male (gender is unknown for two cases). the morbidity among women was higher ( . vs. . cases per , population), so men were . ( % ci . - . ) less likely to become ill. however, men aged years and older had a . ( % ci . - . ) higher risk to die than women aged years and older ( . cases vs. . cases per , population). out of cases, at the stage of data consolidation as of may , , we have information about the symptoms of . % ( cases). out of cases, . % ( cases) had no symptoms, . % ( cases) had mild symptoms, and . % ( cases) had severe disease (including cases required intensive care and/or ventilation). most of the cases were reported from the central part of hungary, from the capital ( cases) and the surrounding pest county ( cases). see figure for a comparison of the capital region with the rest of hungary. the morbidity (per , population) was also the highest in budapest ( . ). the epidemic curve ( figure ) reflects a propagated source epidemic especially when we consider only those cases that cannot be connected to outbreaks in closed communities (like long-term care facilities or hospitals) or to health care associated infections. out of cases, . % ( cases) were associated with health care and/or outbreaks in hospitals, contributing to the daily reported new cases since mid-march. health care workers had . times ( % ci . - . ) higher risk to become a confirmed covid- case in comparison to the general population ( . cases vs. . cases per , population). out of cases, . % ( cases) were reported from long-term care facilities (nursing homes and other closed communities like homeless shelters) contributing to the daily reported new cases since early april. at the peak of the epidemic curve, . % ( cases) of cases on april were reported from the same retirement and assisted living facility. figure shows the results for the real-time estimation of the reproduction number. it showed a steady decline-apart from an outlying effect in early april-and became close to, or even below by mid-april, and remained at that level since then. this conclusion is robust to the chosen methodology. results for the real-time estimation of cfr are shown in figure . note that-as the outbreak is coming to its end-the naive method converges to the final value that was readily well estimated almost a month earlier by the corrected technique. (the naive estimator is increasing as deaths still occur, but case count is already low at the end of the epidemic.) the final cfr to characterize this phase in hungary is about %. various ifr estimations have been published, for example . % for china [ ] , . % for uk [ ] . recent serological studies found ifr values spanning from . % in a german town [ ] , to . % in milan [ ] . note that the testing intensity-and therefore the ascertainment rate-may very well change over time, e.g., with the increase of testing intensity. this analysis is based on the data from the early phase as a whole and, therefore, it is considered as an estimation of the average. the results for the estimation of the ascertainment rate are shown in table , where we explore a reasonable range of ifrs from . % to . %. note that earlier estimates based on [ , ] are consistent with the preliminary results of a large-scale hungarian sero-epidemiological study [ ] . most studies concerning the early growth-rate of the epidemic in wuhan estimated the value of the basic reproduction number to be around . - . (see e.g., [ , ] ), also later studies regarding the spread in other countries [ , ] used similar values. our estimations given in section . shows that in hungary the highest value of the effective reproduction number was . , by the wallinga-teunis method. hence, we choose r = . for the basic reproduction number (comparable with a similar reproduction number for germany in the early phase [ ] , . for italy [ ] ). modeling studies [ , , , ] highlighted that the worst case, i.e., "do nothing" scenarios lead to an outbreak when the healthcare demand substantially exceeds the capacities at the peak and the overall mortality reaches severe levels. given the current level of preparedness, we do not consider a "do nothing" scenario, and our most pessimistic case assumes that, even in the absence of any control measures, a % reduction in transmission is realized due to population awareness and behavior. on the other hand, the best case is the continuation of the current suppression scenario with r ≈ , resulting in very small case numbers. however, it is questionable whether it can be sustained until a vaccine is developed and deployed. below, we consider three scenarios illustrating the loss of control for suppressing the outbreak, and assuming a wide community spread of the disease. the efficacy of the mitigation efforts is expressed by a percentage in the reduction of transmission. the primary tool for this is the decrease of contact numbers, but other preventive measures such as hand hygiene or mask wearing may also have an effect in the reduction of transmission. first, let us consider a weak control of the epidemic assuming there is no centralized control measure introduced, but the number of transmissions is reduced by % following a level of behavioral response due to social awareness. such a reduction decreases the reproduction number to r = . . the first column of figure shows the hospitalization and icu demand on the top row and the daily incidences on the bottom row as a function of time with the application of this weak control. according to the simulations, in this case, there would be approximately . million infections with about , deaths by the end of the outbreak. this suggests that we can expect % of the population to gain immunity against the virus and this number is slightly larger than the threshold of herd immunity (that is ( − /r ) ∼ . % with r = . for the "do nothing" scenario). at the peak, there would be a need for more than icu beds and for , hospital beds with such a weak measure. we remark that there is a -days window when the daily incidences exceed , , and during this period more than . million people ( % of the population) get infected. in other words, % of all the infections occur during these three weeks. for further details, see table . we perform similar simulations for the case of a moderate control, assuming that the reproduction number is decreased to r = . as a result of the control measures. the simulations (second column of figure ) show that the number of hospital beds and icu beds needed is significantly reduced to and at the peak, respectively. meanwhile, the daily incidence at the peak is around , . we expect almost . % percent of the population to be infected throughout the epidemic and gain immunity upon recovery. this is less than required to reach herd immunity. for further information, we refer to table . finally, we consider a stronger control achieving a % reduction of transmission. this results a decrease of the reproduction number to r = . . the outcome of this strong control is shown in the third column of figure . a control of such strength significantly reduces the number of all infected and hospitalized cases and of those needing intensive care treatment. the number of required intensive care beds (around ) is far below the available capacity even at the peak of the epidemic and also the number of hospital beds needed is reduced to a rather low level-around at the peak. the total number of fatalities in this scenario is about . meanwhile, the epidemic would last for more than a year and the cumulative number of all infected remain far below the level of herd immunity threshold, so we can expect further outbreaks when the measures are relaxed. several key parameters of the model are highly dependent on age. intervention strategies and the relaxation of various measures have to take into account the fact that different age groups have different risks and different roles in the transmission. although the number of children infected with covid- has been reported worldwide relatively small in comparison with other age groups [ ], some evidence shows that children and adolescents may become infected and spread the disease as other age groups [ , ] . moreover, children and adolescents usually have a high number of contacts. thus, school closures can be expected to be an efficient tool to reduce the contacts and transmissions. besides school closures, it is important for younger individuals to avoid meeting older and other high risk people. elderly people have a higher chance of developing symptoms, and a higher percentage of them needs hospitalization and intensive care, hence these groups need more protection. age-specific interventions include avoiding contacts with elderly by providing special time slots for shopping, in post offices, etc., or closing/reopening schools. the introduction of various age groups in our model enables us to study such age-specific interventions and analyze their direct and indirect effects on all groups. on the stacked diagrams of figure , we present the contributions of the age groups to the mortality and the number of recovered individuals. columns of this figure show the effect of the weak, moderate, and strong control that we previously discussed in details in section . and table . here, we would like to emphasize that, in the case of each control measure, the most vulnerable age groups are the groups of elderly ( - , - , +) people as they suffer most of the fatalities; meanwhile, they are predicted to produce only a small fraction of the cases in the population. figure . age-specific mortality and recovery. the figure shows the effect of the weak, moderate, and strong control ( %, % and % general contact reduction, respectively). every age group covers at most one decade except the group of "middle aged" that represents three decades. according to our model, elderly people ( +) are predicted to produce most of the fatality cases in each scenario. the legend on the bottom applies for all figures. we consider two school closure scenarios: an optimistic and a pessimistic one (with respect to the outcome of the outbreak); both use the weak control scenario ( % general decrease in transmission, cf. section . ) as a starting point. the optimistic case is comprised of omitting the school component of the contact matrix and halving the other contacts [ ] of children and young adults (between age - ), which provides a new global contact matrix for this intervention. in the pessimistic scenario, we omit the school component of the contact matrix as well, but, instead of halving, it considers a % increase in the other contacts of children and young adults. arguably, the students might replace some school contacts by new other contacts, due to other activities. however, many of such contacts are lost as well: for example, they do not use public transportation to/from the school, and extracurricular activities also drop. since the exact balance is difficult to estimate, our two closure scenarios serve as a boundary to explore this regime of possibilities. note that, by school closure, we mean the closure of educational institutions from preschools to universities. as a reference, we also incorporate the weak control scenario to this analysis. figure shows that this measure decreases the peak hospital bed and icu needs to approximately % compared to the case when we only apply weak control in the optimistic scenario and by % in the pessimistic one. moreover, closing schools postpones the peak of the epidemic (by about one month in case of the above setting), suggesting that children may play a significant role in transmission due to their large number of contacts, even though they give negligible contribution to the overall mortality, cf. top row of figure ). note that this conclusion is based on the assumption that all age groups are equally susceptible, and symptomatic children are equally infectious to adults, and age specific difference appears only in the probability of developing symptoms, which is much smaller for children in our model (see parameters p i in table ). school closure in addition to the % contact reduction (pessimistic approach) school closure in addition to the % contact reduction (optimistic approach) figure . effect of school closure. simulations suggest that school closures-if maintained for a long period-effectively decrease peak hospital bed and icu needs and significantly postpone the peak of the epidemic. the effect of school closure combined with the % general reduction in transmission is comparable, in the optimistic case, with the effect of moderate control ( % reduction in transmission, cf. section . ) regarding the peak hospital bed and icu need, but not as significant in decreasing the mortality (figure middle column). however, to achieve this, schools need to be closed for an extended period of time, which may not be feasible. we also point out that a standalone closure of preschools and primary schools is not sustainable without a certain amount of home office of the parents, but this opens up sociological and economical questions that we do not address here. the elderly being the most vulnerable group of the population, when it comes to relaxation of measures introduced against the spread of covid- . most countries handle these age groups separately from the rest of the population, e.g., separate time slots for shopping continue to exist and elderly are encouraged to keep the same level of social distancing [ , ] . to include these effects in our model, we manipulate the entries of the contact matrix involving older age groups separately from the remaining parts. figure illustrates that, in addition to the weak control, if % and % reduction of the outside household connections of elderly people is applied, then we can expect about % and % reduction in the hospital, icu bed needs, and mortality. the epidemic curves only slightly shift to the right suggesting that elderly people do not play an important role in the transmission of the disease due to their low number of contacts. in addition, % reduction of contacts outside the household is again not feasible, as this would mean the complete isolation of a large sub-population. we plotted this scenario only to show the theoretical limits of this approach. general % contact reduction (weak control) % contact reduction for elders outside of households + weak control % contact reduction for elders outside of household + weak control figure . protection of the elderly. the figures show the effect of an additional contact reduction of elderly people in case of a weak control. the figures suggest that the selective protection of elderly people can successfully reduce the peak icu need and the overall mortality, yet it has a theoretical limit. in this section, we investigate the epidemic curves in case of the weak, moderate, and strong control with seasonality of various strengths expressed by parameter c ∈ { , . , . , . }, see ( ) . during the summer, these values of c eventuate a %, %, and % further decrease in transmission as that is when the seasonality curve attains its minimum. the case c = means that there are no seasonal effects at all, while c = . is a strong seasonality, which is similar to h n [ ] . see the top left image of figure for the seasonality functions ω(c, t) corresponding to the different c values. as we have seen in section . , decreasing the reproduction number decreases and postpones the peak of the epidemic curves. seasonality causes a similar delay in the peak of the epidemic due to decreased transmission rates in the summer months. counter-intuitively, it cannot be said in general that stronger seasonality leads to a smaller peak (cf. bottom left image of figure ). the reason for this is that the impact of seasonality is not only determined by the decrease in the transmission rate, but the temporal relation between the peak of the epidemic and the minimum of the seasonality function is also an important factor. this phenomenon is well illustrated in figure where three scenarios (weak, moderate, and strong control) are presented along with the assumed seasonality functions for the aforementioned values of c. in the upper right image of figure , corresponding to a weak control, one can observe that increasing the effect of seasonality first decreases the peak, but, after a certain value (c = . in our example), the epidemic is so much suppressed in the summer months that the peak shifts to the right and even slightly increases in winter months compared to the c = . scenario. for the case of moderate control, shown in the lower left figure, this effect is much more significant. note that the peak of the epidemic (without seasonality) is so far from summer (the minimum of the seasonality curves) that increasing the effect of seasonality results in a significantly higher peak. it can be seen that strong seasonality eventuates a long "plateau" phase when the epidemic curve does not increase in a period of six months. during this time, only a small fraction of the population goes through the infection and a massive number of susceptibles remain in the system, only to get infected a few months later. this phenomenon is responsible for the increased peak of c = . compared to the c = . case. the lower right figure shows that the reduction of transmission during the warm months together with a strong control can decrease the number of infected in such an extent that the peak, even if arriving in the winter months, is significantly smaller. a general observation is that seasonality has the largest impact on the epidemic curve if the peak time is close to the summer months. of course, this is highly dependent on the starting time of the outbreak. hungary is a relatively small country; however, significant differences were observed between regions in the reported case numbers. the capital, budapest, has . million inhabitants and a further . million people live in its surrounding pest county. budapest and pest county are highly connected by commuters with connections to other regions as well [ ] . the high connectivity of the capital with other countries contributed to the earlier appearance of the disease in budapest, and most of the cases were reported from this central region of the country. to address the role of spatial heterogeneity in the evolution of the epidemic curve, we considered a metapopulation model as in ( ) . hence, the population is distributed among patches, representing geographic regions of the country. for the sake of simplicity, here we only present results from a two-patch model, separating budapest and pest county (patch , population of approx. , , ) from the remaining parts of the country (patch , population , , ). we assumed different transmission parameter β for each patch. based on hungarian mobility data on commuters [ ] , we assumed , daily travels between the two patches in the case of normal circumstances and investigated the effect of the lockdown of budapest and the surrounding pest county by decreasing the number of daily travels to , . we considered the contact matrix for both patches to be the same as in the uniform model described in section . . . the biological and medical parameters are assumed to be the same in each patch, but the local reproduction number may differ, as well as the age structure of the population. the left-hand side of figure illustrates that the two-patch model reproduces the uniform model in case we use the same r = . for both patches as well as for the uniform model and we assume , daily travels between the patches. the middle figure shows that the uniform model slightly overestimates the size of the epidemic as the peak of the aggregated two-patch model is smaller than that of the uniform model in case r = . remains the same, and we reduce the daily travels to , corresponding to the separation of budapest and pest county from other regions. although the epidemic curves of the patches are shifted, the aggregated result shows that this setup does not provide significantly different dynamics. lastly, on the right-hand side of figure , we further investigate the scenario of , daily travels, and choose the local reproduction numbers of the patches to vary around r = . , namely, we take r budapest = . and r other regions = . . these values were selected to reflect the higher population density of the capital, proportionally to the population in the two patches. due to the difference in the local reproduction numbers, we may observe an increased number of cases in budapest with an earlier peak and fewer infections in other regions. figure . epidemic curves of the regions: sum of the infective compartments (i p , i a, , i a, , i a, , i s, , i s, , i s, ). first, we consider identical reproduction numbers r = . for both patches (budapest with pest county and other regions). without any travel reductions, the two-patch model gives identical results to the one-patch version, as seen in the left figure. next, if travel reductions are put in place, the one-patch model overestimates slightly the size of the epidemic for equal r values. finally, assuming different reproduction numbers and large reduction in travel, the peak occurs earlier in the patch with larger r (budapest and pest county); furthermore, the one-patch model and the aggregated two-patch model differ in both time and size of the peak. for an uncontrolled epidemic in the uk, ref. [ ] estimated a peak in icu bed demand more than times greater than the maximum capacity in these countries. in a study for the united states, ref. [ ] projected that, at the outbreak peak, three times more icu beds would be needed than the total number of icu beds in the us, and % isolation of cases reduces the demand for icu beds to the normal capacity. in the Île-de-france region, ref. [ ] estimated that the peak number of icu beds needed would exceed more than times the regional capacity if no strategy is implemented after lockdown, and only efficient case-finding and isolation applied parallel with social distancing could decrease icu demand below the maximum capacity throughout the epidemic. for australia, ref. [ ] studied three capacity expansion scenarios ( , and times expansion, respectively), and, even in mitigated scenarios, demand is estimated to be higher than the number of available beds. additional social distancing measures were shown to reduce the epidemic to a level where a reasonable expansion of icu capacity can be sufficient. the peak icu demand crucially depends on two factors: the probabilities of developing severe disease, and the shape (in particular the peak size) of the epidemic curve. we plotted a heatmap of the peak icu demand in figure , compiled from hundreds of numerical simulations. transmissibility (vertical axis) is expressed by the reproduction number r. disease severity, for simplicity, is expressed by the ifr. in fact, here we used a scaling factor for the probability of hospitalization, with the baseline corresponding to the parameters in table . in our weak control scenario (section . ), the ifr is . %, which is a bit lower than the finding of [ ] . however, during the first wave in hungary, the schools were closed and covid- disproportionately affected the vulnerable population. in our scenarios, we assume a widespread community spreading, hence younger generations appear in higher numbers, thus the ifr is expected to be smaller. in any case, by the scaling of the hospitalization rate (while leaving the probability of intensive care and fatal outcome given hospitalization intact), we explored a wider range of ifrs. we found that indeed the peak icu demand can vary across a large interval. from the shape of the level curves in the heatmap, we can conclude that the peak icu demand is more sensitive to r than to the ifr, hence flattening the curve is indeed of utmost importance to avoid exceeding healthcare capacities. [ ] . the white dot is our most pessimistic scenario (weak control). the most important implemented measures are summarized in table . to assess their impact, we compared the reported case numbers adjusted by the ascertainment rate : to the simulated outbreak curve with r = . ( figure on the left, logarithmic scale). here, we assumed that the ascertainment rate did not change in time, which may not be the case. one can see that the epidemic was on the r = . trajectory, which could have resulted in substantially more infections. the data shows a clear deviation from this scenario early april, two weeks after strict social distancing started. the slope of the epidemic curve further decreased mid-april, following the stay at home measures by two weeks. overall, due to the compliance of hungarian society with the social distancing measures, around half million infections were averted by the end of april, compared to the "do nothing" scenario, which could have reached - million in may if further doublings would have been allowed. the first covid- case was detected, the laboratory confirmed, and then reported through the hungarian notifiable disease surveillance system on march . well tailored, effective, combined non-pharmaceutical control measures have been introduced promptly in hungary in the very early phase of the outbreak (see table ), accompanied with a high level of compliance for social distancing. online surveys [ ] , polling, and indirect data (such as traffic data, passenger volumes on public transportation, etc.) all showed a drastic reduction in the number of contacts and mobility. in particular, the online questionnaire maszk [ ] showed a - % decrease (depending on the locality) in the daily number of physical contacts as well as in the number of closed contacts per capita, based on the replies of , respondents by may , constituting a non-representative, but rather large sample. accordingly, the hungarian epidemic curve was strongly suppressed. as of may , the cumulative number of reported confirmed covid- cases were ( . cases per , population), including deaths. the epidemic peaked on april with newly reported cases. sars-cov- was not able to sustain long transmission chains in the community; however, it was able to cause outbreaks mostly in healthcare institutions and long-term care facilities: nearly two thirds of the reported cases are connected to such institutions. the proportion of cases in health care workers gradually increased during the epidemic. they had tenfold risk to become confirmed covid- cases compared to the general population. due to effective measures, the virus could not spread significantly from closed communities and health care workers to the wider population. the age specific cfr showed a similar pattern to other countries: of the deaths reported by may, ( . %) belonged to the + age group. we tracked the temporal variation of the effective reproduction number in real time, which showed a steadily decreasing trend, interrupted by an outlying outbreak in a long-term care facility. we identified the time intervals when the effective reproduction number was below or around the critical threshold . the adjusted cfr was also estimated real-time, and predicted the eventual cfr one month in advance well. benchmarking the cfr to other countries, we estimated underascertainment rate to be - times, and the true cumulative number of covid- cases to be between , and , . these results are consistent with data from the preliminary results of a large scale seroepidemiological survey, carried out in hungary in may , where the seroprevalence of sars-cov- infection was estimated to be between , and , [ ] . based on these data and the number of reported cases, underascertainment is likely to be between . - . , and the true cfr may be lower than . %, and the ifr is roughly half of that. as control measures are being successively relaxed since may , we established an age-structured compartmental model to investigate several post-lockdown scenarios, and projected the epidemic curves and the demand for critical care beds assuming various levels of sustained reduction in transmission. special measures designed to reduce the contact number of the elderly population as well as school closures can reduce the peak hospital bed demand and the overall mortality; however, these measures also have their limitations. a metapopulation version of the transmission dynamics model has also been studied, and we reported some results for a two-patch case, where the budapest region is considered separately from the rest of the country. due to the high connectedness, the epidemic curves of the two-patch system are not much different from the spatially uniform case. to achieve a noticeable reduction in the overall peak size due to spatial heterogeneity (where the local peak times are shifted in the regions), a large reduction in the mobility rates is necessary. since the majority of the population is still susceptible (over %, according to [ ] ), a weak or even a moderate reduction in the transmission, compared to the baseline, could result in a large second outbreak with significant mortality and high peak icu demand. therefore, a high level of alertness needs to be maintained to avoid such scenarios. the seasonal behavior of sars-cov- is not completely understood yet [ , ] , thus we considered a range of possibilities from the absence of seasonality to a strong seasonality, which is similar to h n . the interplay of seasonal effects with the post-lockdown contact numbers can generate a variety of disease dynamics; thus, a confident forecast of the timing and the size of a potential second wave is not possible at the moment. the effectiveness of strict social distancing measures, such as school closures and stay at home measures with good compliance is likely to be very high; however, such interventions have negative consequences on the society and on the economy and are thus not sustainable in the long term. modeling results [ , ] suggest that combined multiple interventions, including moderate contact decrease, high covid- detection rate, effective contact tracing, and good compliance with personal protective instructions, may have substantial impact on transmission, and are able to keep the reproduction number around one. situation reports; world health organization france: surveillance, investigations and control measures cov- was already spreading in france in late investigation of a covid- outbreak in germany resulting from a single travel-associated primary case: a case series covid- epidemic in italy: evolution, projections and impact of government measures who. statement on the second meeting of the international health regulations ( ) emergency committee regarding the outbreak of novel coronavirus director-general's opening remarks at the media briefing on covid- ; world health organization covid- ) in the eu/eea and the uk-ninth update multiple sars-cov- introductions shaped the early outbreak in central eastern europe: comparing hungarian data to a worldwide sequence data-matrix covid- announcements of hungary r: a language and environment for statistical computing; r foundation for statistical computing elegant graphics for data analysis table: extension of 'data.frame'. r package version . . shiny: web application framework for r real-time epidemiology of covid- in hungary (a magyarországi koronavírus járvány valós idejű epidemiológiája-in hungarian) real-time epidemiology of covid- in hungary (a magyarországi koronavírus járvány valós idejű a new framework and software to estimate time-varying reproduction numbers during epidemics different epidemic curves for severe acute respiratory syndrome reveal similar impacts of control measures how generation intervals shape the relationship between growth rates and reproductive numbers the effective reproduction number of pandemic influenza: prospective estimation association of public health interventions with the epidemiology of the covid- outbreak in wuhan effective reproduction number estimation estimating in real time the efficacy of measures to control emerging communicable diseases serial interval of covid- among publicly reported confirmed cases epidemiological characteristics of covid- cases in italy and estimates of the reproductive numbers one month into the epidemic serial interval of novel coronavirus (covid- ) infections the r package: a toolbox to estimate reproduction numbers for epidemic outbreaks r : estimation of r and real-time reproduction number from epidemics. r package version . - estimate time varying reproduction numbers from epidemic curves. r package version . - . methods for estimating the case fatality ratio for a novel, emerging infectious disease assessing the severity of the novel influenza a/h n pandemic early epidemiological assessment of the virulence of emerging infectious diseases: a case study of an influenza pandemic real-time estimation of the risk of death from novel coronavirus (covid- ) infection: inference using exported cases incubation period and other epidemiological characteristics of novel coronavirus infections with right truncation: a statistical analysis of publicly available case data rstan: the r interface to stan using a delay-adjusted case fatality ratio to estimate under-reporting ( ). cmmid risk assessment of novel coronavirus covid- outbreaks outside china modelling the impact of covid- in australia to inform transmission reducing measures and health system preparedness expected impact of lockdown in Île-de-france and possible exit strategies a first study on the impact of current and future control measures on the spread of covid- in germany report -impact of non-pharmaceutical interventions (npis) to reduce covid- mortality and healthcare demand projecting hospital utilization during the covid- outbreaks in the united states covid- epidemic risk assessment for georgia temporal dynamics in viral shedding and transmissibility of covid- early transmission dynamics in wuhan, china, of novel coronavirus-infected pneumonia effects of latency and age structure on the dynamics and containment of covid- appropriate models for the management of infectious diseases hungarian data supply questionnaire (maszk-magyar adatszolgáltató kérdőív-in hungarian) code basis for covid modelling in hungary contact tracing assessment of covid- transmission dynamics in taiwan and risk at different exposure periods before and after symptom onset quantifying sars-cov- transmission suggests epidemic control with digital contact tracing features of , hospitalised uk patients with covid- using the isaric who clinical characterisation protocol report on covid- in critical care incidence, clinical outcomes, and transmission dynamics of severe coronavirus disease in california and washington: prospective cohort study children with covid- in pediatric emergency departments in italy projecting social contact matrices in countries using contact surveys and demographic data the construction of next-generation matrices for compartmental epidemic models preliminary results of the h-uncover study estimates of the severity of coronavirus disease : a model-based analysis infection fatality rate of sars-cov- infection in a german community with a super-spreading event sars-cov- seroprevalence trends in healthy blood donors during the covid- milan outbreak pattern of early human-to-human transmission of wuhan covid- outbreak in italy: estimation of reproduction numbers over two months toward the phase . medrxiv ministry of health, welfare and sport, netherlands. children and covid- an analysis of sars-cov- viral load by patient age covid- and the consequences of isolating the elderly seasonal transmission potential and activity peaks of the new influenza a (h n ): a monte carlo likelihood analysis based on human mobility the salient targets of commuters social distancing strategies for curbing the covid- epidemic temperature, humidity, and latitude analysis to estimate potential spread and seasonality of coronavirus disease (covid- ) cmmid covid- working group. effects of non-pharmaceutical interventions on covid- cases, deaths, and demand for hospital services in the uk: a modelling study cmmid covid- working group. effectiveness of isolation, testing, contact tracing, and physical distancing on reducing transmission of sars-cov- in different settings: a mathematical modelling study the authors declare no conflict of interest. key: cord- - jee hx authors: waelde, k. title: how to remove the testing bias in cov- statistics date: - - journal: nan doi: . / . . . sha: doc_id: cord_uid: jee hx background. public health measures and private behaviour are based on reported numbers of sars-cov- infections. some argue that testing influences the confirmed number of infections. objectives/methods. do time series on reported infections and the number of tests allow one to draw conclusions about actual infection numbers? a sir model is presented where the true numbers of susceptible, infectious and removed individuals are unobserved. testing is also modelled. results. official confirmed infection numbers are likely to be biased and cannot be compared over time. the bias occurs because of different reasons for testing (e.g. by symptoms, representative or testing travellers). the paper illustrates the bias and works out the effect of the number of tests on the number of reported cases. the paper also shows that the positive rate (the ratio of positive tests to the total number of tests) is uninformative in the presence of non-representative testing. conclusions. a severity index for epidemics is proposed that is comparable over time. this index is based on covid- cases and can be obtained if the reason for testing is known. background. statistics have gained a lot in reputation during the covid- pandemic. almost everybody on this globe follows numbers and studies "the curve"on recorded cases, on daily increases or on incidences of cov- infections. the open question. what do these numbers mean? what does it mean that we talk about "a second wave"? intuitive interpretations of "the curve" suggest that the higher the number of new infections, say in a country, the more severe the epidemic is in this country. is this interpretation correct? when the number of infections increases, decision makers start to discuss additional or tougher public health measures. is this policy approach appropriate? our message. reported numbers of cov- infections are probably not comparable over time. when public health authorities report x new cases on some day in october , these x new cases do not have the same meaning as x new cases in april, may or june . the bias results from di¤erent testing rules that are applied simultaneously. private and public decision making should not be based on time series of cov- -infections as the latter do not provide information about the true epidemic dynamics in a country. if the reason for testing was known, a unbiased measure of the severity of an epidemic could be computed easily. our framework. we present a theoretical framework that allows one to understand the link between testing and the number of reported infections. we extend the classic sir model (kermack and mckendrick, , hethcote, ) to allow for asymptomatic cases and for testing. our fundamental assumption states that the true numbers of susceptible, infectious and removed individuals are not observed. results. the reason for the intertemporal bias consists in relative changes of test regimes. if a society always employed only one rule when tests are taken, e.g. "test for sars-cov- in the presence of a certain set of symptoms", then infection numbers would be comparable over time. if tests are undertaken simultaneously, e.g. "test in the presence of symptoms"but also "test travellers without symptoms", and the relative frequency of tests changes, a comparison of the number of reported infections over time bears no meaning. the paper illustrates the bias by a "second wave"in reported cases which -by true epidemiological dynamics -is not a second wave. understanding this bias also provides an answer to one of the most frequently asked question when it comes to understanding reported infection numbers: what is the role of testing? do we observe a lot of reported infections only because we test a lot? should we believe claims such as "if we test half as much, we have half as many cases". this paper will provide a precise answer to what extent the reported number of infections is determined by the number of tests in a causal sense. the answer in a nutshell: if tests are undertaken because of symptoms, there is no causal e¤ect from the number of tests on the number of reported infections. if tests are undertaken for other reasons (travellers, representative testing), the number of reported infections go up simply because there is more testing. we show that time series on the number of tests and time series on reported infections do not allow one to obtain information about the true state of an epidemic. we also study the positive rate as the ratio of the number of positive tests to the total number of tests. the positive rate is informative if we undertook representative testing only. the positive rate is not informative about true epidemiological dynamics when there are several reasons for testing. understanding the biases also allows us to understand how to correct for it. the paper presents a severity index for an epidemic that is unbiased. one can obtain this index in two ways: record the reason why a test was undertaken or count only the covid- cases. such an index should be used when thinking about relaxing or reimposing public health measures. testing is important for detecting infectious individuals, counting covid- cases is important for private and public decision making. structure of paper. the next section presents the model. section shows biased and unbiased measures of the true but unobserved dynamics of an epidemic. it also studies the (lack of) informational content of time series on reported infections and time series on the number of tests, and the properties of the positive rate. it …nally presents an unbiased severity index. the conclusion summarizes. the basic assumption of our extension of the susceptible-infectious-removed (sir) model consists of the belief that true infections dynamics are not observable. simultaneous testing of an entire population or weekly representative testing is not feasible -at least given current technological, administrative and political constraints. this section therefore …rst describes the true but unobserved infection dynamics, then introduces tests into this framework and …nally computes the number of reported infections within this framework. the classic sir model we study a population of …xed size p: individuals can be in three states as in a standard sir model. the number of individuals that are susceptible to infection is denoted bys (t) : this number is unobservable to the public and to health authorities. the numbers of infectious and removed (i.e. recovered or deceased) individuals are denoted byĨ (t) andr (t) ; respectively. we assume that individuals are immune and non-infectious after being removed. let the (expected) number of individuals in the state of being susceptible at a point in time t be denoted bys (t) : the number of susceptible individuals falls according to where r is a constant and c (t) rĨ (t) can be called the individual infection rate. it captures the idea that the risk of becoming infected is the greater, the higher the number of infectious individuals. merging individual recovery rate and death into one constant , the number of on the bias due to tests but discuss perceptions brie ‡y below. we write expected number as ordinary di¤erential equations in sir models could or should be understood as kolmogorov backward equations describing means of continuous time markov chains. see karlin and talyer ( ) or ross ( ) for an introduction. in the tradition of diamand-mortensen-pissarides search and matching models in economics (diamond, , mortensen, , and pissarides, , this individual infection rate can be expressed capturing similar ideas as in a matching function: it should not only increase in the number of infectious individuals but also fall in the number of susceptible individuals. the latter reduces the probability that a random contact is infectious. see donsimoni et al. ( a) for an implementation. . cc-by-nc-nd . international license it is made available under a perpetuity. is the author/funder, who has granted medrxiv a license to display the preprint in (which was not certified by peer review) preprint the copyright holder for this this version posted october , . . infectious individuals changes according to finally, as a residual, the number of removed individuals rises over time according to dr (t) =dt = Ĩ (t). we illustrate the dynamics in the following …gure, employed also later on. the true infection dynamics (a simple sir model) we can easily integrate asymptomatic cases into this framework. we split the true number of infectious individuals described in ( ) into symptomatic and asymptomatic cases, this allows us to capture the infection process in ( ) by two distinct di¤erential equations. when hold, ( ) holds as well. individual infection rates are now de…ned as symp c the epidemiological idea behind these equations is simple. the rate with which one individual becomes infected is the same for everybody and given by rĨ (t) : the higher the number of infectious individuals in society,Ĩ (t) ; the higher the rate with which one individual gets infected. it then depends on various, at this point partially unknown, physiological conditions of the infected individual whether they develop symptoms or not. we denote the share of individuals that develop symptoms by s: we assume this share is constant. epidemiological dynamics this completes the description of the model. let us now describe how we can understand (unobserved) epidemiological dynamics. we start with some initial condition fors (t). a good candidate would bys ( ) = p; i.e. the entire population of size p is susceptible to being infected and become infectious. initially, there are very few infectious individuals, say, there are two,Ĩ symp (t) =Ĩ asymp (t) = : given infection rates ( a) and ( b) and parameters, the number of infectious symptomatic and asymptomatic cases evolves according to ( ) and ( ). the model neglects the e¤ect of quarantine. if infectious individuals know about their status and therefore stay in quarantine, they should be removed fromĨ (t) or at least get a lower weight in ( ). infectious individuals are removed from being infectious at a rate : the number of susceptible individuals follows ( ). the epidemic is over with herd immunity, i.e.s = at some point (far in the future) or when recovery is su¢ ciently fast relative to in ‡ows such thatĨ = : the epidemic is heading towards an end when d dtĨ symp (t) < and d dtĨ asymp (t) < ; i.e. the number of infectious individuals falls. we abstract from public health measures and their e¤ects (as studied e.g. by dehning et al., or donsimoni et al., a . if we wanted to include them, we could allow public health measures to a¤ect r in the individual infection rate in ( ). to understand the e¤ects of tests, we now introduce testing into our sir model. the following …gure displays all unobserved quantities in the model by dashed lines. the red circles represent the standard sir model illustrated in …gure . testing can take place for a variety of reasons described in test strategies adopted by various countries. the reasons for tests we take into account at this point is testing due to the presence of typical symptoms, representative testing and testing travellers. while testing by symptoms and representative testing is well-de…ned, testing travellers is really only an example for a larger type of test. this example covers all tests that are applied to a group de…ned by certain characteristics which, however, are not representative of the population as a whole. other examples of this non-representative testing include testing of soccer players, testing in retirement homes or their visitors, testing in hot spots or testing contact persons of infected individuals. the sir model with testing it would be straightforward to assume, e.g. asym p > sym p . this would capture the idea that asymptomatic cases recover faster than symptomatic cases. we ignore this extension as this distinction would not a¤ect our main argument. for analytical solutions of the classic sir model, showing this aspect most clearly, see harko et al. ( ) or toda ( ) . this condition is related to the widely discussed reproduction number. current applications of the sir model also badly neglect the non-exponential distribution in various states. it is well-known (e.g. linton et al., or lauer et al., that incubation time is (approximately) lognormally distributed. it is now also understood that the reporting delay per se and added to incubation time is also non-exponentially distributed (mitze et al., , app. a. ) . the "chain trick" (hurtado and kirosingh, ) would allow to implement this numerically. meyer-herrmann ( ) employed a related struture but did not focus on densities of duration explicitly. . cc-by-nc-nd . international license it is made available under a perpetuity. is the author/funder, who has granted medrxiv a license to display the preprint in (which was not certified by peer review) preprint the copyright holder for this this version posted october , . . https://doi.org/ . / . . . doi: medrxiv preprint testing by symptoms individuals can catch many diseases (or maybe better sets of symptoms) indexed by i = :::n. for simplicity, the above …gure displays only two diseases ( and ) and covid- . the number of individuals that have a disease i and go to a doctor on day t is d i (t) : an individuals becomes sick with an arrival rate i and recovers from this speci…c sickness i with a rate of i . for clarity, we add a symptomatic sars-cov- infection to this list of diseases. the individual is infected and develops symptoms with rate symp c (t) ; which we know from ( a), and is removed with rate c : the number of symptomatic sars-cov- individuals isĨ symp (t) from ( ). there is a certain probability p i that a doctor performs a test, given a set of symptoms i: this probability re ‡ects the subjective evaluation of the general practitioner (gp) whether certain symptoms are likely to be related to sars-cov- . the probability to get tested with symptomatic sars-cov- infection (which the gp of course does cannot diagnose without a test) is denoted by p c : hence, the (average or expected) number of tests that are performed at time t due to consulting a doctor is given by the second equality replaces the number of tests by the number of sick individuals per disease times the probability that this individuals is tested. note that, apart from population size p; the number of tests taken because of the presence of symptoms, t d (t) ; is the …rst variable that is observed. if health authorities collected information why a test was performed (set of symptoms that can be observed by a gp), we would observe t d i (t) and t d c : if not, we observe t d (t) only. tests can be performed for a variety of reasons. one consists in testing travellers, another consists in tests for scienti…c reasons and so on. theses tests are not related to symptoms. taking the example of representative tests, the tests are applied to the population as a whole. the number of tests is chosen by public authorities, scientists, available funds, capacity considerations and other. in any case, it is independent of infection-characteristics of the population. concerning representative testing, we denote the number of tests of this type undertaken at t by t r (t) : when it comes to travelers, we denote the number of tests by t t (t) : summarizing, the total number of tests being undertaken in our model is given by the sum of tests due to symptoms, t d (t), representative tests t r (t) and testing travellers, t t (t) ; the second equality employs the number of tests by symptoms from ( ). the equation thereby reemphasizes the endogeneity of the number of tests by symptoms, t d (t) is determined by the number of symptoms occurring in a country or region, and the exogeneity of other reasons for testing, t r (t) and t t (t) : the latter are not determined by symptoms. the number of reported infections at time t is given by the sum of reported infections split by the reasons for testing introduced above, in a broader interpretation, one could understand p i and p c as the probabilities that an individual gets tested and that they go to the doctor. no test is ever performed if individuals with symptoms stay at home. . cc-by-nc-nd . international license it is made available under a perpetuity. is the author/funder, who has granted medrxiv a license to display the preprint in (which was not certified by peer review) preprint the copyright holder for this this version posted october , . . https://doi.org/ . / . . . doi: medrxiv preprint testing by symptoms as we are perfectly informed in our theoretical world about the (expected) number of cov- infections and other diseases, we know that the number of positive cov- tests is zero for all diseases, i i = : individuals have covid- related symptoms because they caught a cold, they have the ‡u or other. the probability that a cov- infected individual has a positive test is set equal to one (ignoring false negative tests). the number of positive tests for individuals that are infected with cov- is therefore identical to the number of tests, testing for other reasons the probability that a representative test is positive is denoted by p r (t) : this probability is a function of the true underlying and unobserved infection dynamics. if the sample chosen is truly representative, then the probability for a positive test is given by hence, representative tests make the true numberĨ (t) of infectious individuals visible for the moment at which the tests are undertaken. this true number includes symptomatic and asymptomatic cases as in ( ). the probability that a test of travellers is positive depends on a multitude of determinants among which region traveled to and behaviour of the traveller. we denote the probability that such a test is positive by p t (t). we consider this probability to be exogenous to our analysis. a …rst step towards the total number of reported infections starts from ( ) and takes ( ) and ( ) into account, this is also the expression displayed in …gure between 'cov- tests'and 'con…rmed infections'. reported infections come from testing cov- individuals with symptoms, from representative testing and from other sources such as travellers. employing t d c (t) = p cĨsymp (t) from ( ) and p r (t) from ( ), the number of reported infections can be written as unbiased and biased reporting is the author/funder, who has granted medrxiv a license to display the preprint in (which was not certified by peer review) preprint the copyright holder for this this version posted october , . . https://doi.org/ . / . . . doi: medrxiv preprint our central equation is ( ). the reported number of infections would be unbiased if only tests by symptoms were undertaken. reported infections (with t r (t) = t t (t) = ) would by ( ) amount to when the reported number of infections i (t) goes up, one would be certain that the unobserved number of symptomatic cov- infectionsĨ symp (t) would go up as well. the more infections are reported, the more severe the epidemic is. this equation also shows under which circumstances the number of tests does not have a causal e¤ect on the number of reported infections. if tests are undertaken according to a rule that makes testing dependent on something else (e.g. the presence of symptoms), the number of tests itself is determined by the number of symptoms. hence, while the number of tests and reported infections are correlated, the causal underlying factor is the number of patients visiting a physician with cov- related symptoms. a second example of unbiased testing is (exclusive) representative testing. when only representative testing is undertaken, the number of reported infections (with t d c = t t (t) = ) from ( ) amounts to here, the number of reported infections, i (t) ; does rise in the number of tests, t r (t). the more we test, the higher the number of cases. yet, representative testing is (of course) the gold standard of testing. the ratio of positive cases to the number of tests yields the share of infections in the population, this share is driven byĨ (t) which shows that (i) representative testing provides a snapshot at this point in time t of the current epidemic dynamics and that (ii) representative testing provides a measure of overall infections, i.e. symptomatic and asymptomatic ones. we have seen two examples of unbiased reporting, one for symptomatic infections, one for all infections. they show that the question whether the number of reported cases rises in the number of tests is not as important as the question whether the type of testing provides useful information. in the …rst example, the claim that more tests increase the number of reported infections is meaningless as the number of tests is not chosen. in the second example, the number of positive cases rises in the number of tests but the ratio of these two quantities is highly informative. illustrating a bias now imagine several types of testing are undertaken simultaneously. the number of reported infections at t is then given by the full expression in ( ). consider …rst the case of symptomatic and representative testing. the number of reported cases (with t t (t) = ) is then this ratio is an example of the 'positive rate'. we will study it in more detail below. . cc-by-nc-nd . international license it is made available under a perpetuity. is the author/funder, who has granted medrxiv a license to display the preprint in (which was not certified by peer review) preprint the copyright holder for this this version posted october , . . imagine someone (the government, researchers, other) decide to undertake more representative testing, i.e. t r (t) goes up. this means that i (t) increases even though there is no change in the true numberĨ symp (t) of symptomatic cases. there is also no change in the true number i (t) of symptomatic and asymptomatic cases. whoever perceives the reported number i (t) is led to believe that something fundamental has changed within the epidemiological dynamics. but this is of course not true. the reported number goes up simply because more tests were undertaken. can we gain some information out of this expression if we divide it by the number of tests t r (t) as it had turned out to be very useful in the case of exclusive representative testing in ( )? we would obtain which does contain the informative infection shareĨ (t) =p as the second term on the right hand side. but the …rst term does not have a meaningful interpretation and neither does the entire term. let us illustrate the potential bias by looking at the third type of testing considered heretesting travellers. the number of reported infections according to ( ) in the case of testing by symptoms and testing travellers reads we assume that no testing of travellers took place at the beginning of the pandemic. at some later point (as of t = in our …gure below), the number of tests per day, t (t) ; increases linearly in time. to make this example as close to public and common displays of infection dynamics, let us look at "the curve" represented in …gure by numbers of infected individuals taking recovery into account. looking at ( ) shows that a further source of bias, brie ‡y mentioned earlier, can easily be identi…ed. imagine the general perception of gps changes over time concerning . then a gp might be initially sceptical, i.e. p c is low, then become more aware of health risks implied by cov- , p c goes up, to then maybe during some other period become more reluctand again. if these changes in individual perceptions are not entirely idiosyncratic but driven by the overall attention in society to an epidemic, the number of reported infections would change independently of the true number of infections,Ĩ sym p (t) orĨ (t) : one could draw similar …gures with new infections per day or the number of individuals ever infected. the basic argument would remain the same. . cc-by-nc-nd . international license it is made available under a perpetuity. is the author/funder, who has granted medrxiv a license to display the preprint in (which was not certified by peer review) preprint the copyright holder for this this version posted october , . . looking at …gure we …rst focus on the blue dashed curve forĨ symp (t) ; the true number of symptomatic sars-cov- infections. we chose parameters such that the epidemic is coming to a halt after around units of time (plotted on the horizontal axis). the green curve plots the number of reported infections i d (t) from ( ) where testing takes place only in the presence of symptoms. finally, the red curve is an example of a bias in the reported number of infections. it occurs as positive tests from testing travellers are added to tests by symptoms as in ( ). we see that this example displays what looks like a "second wave": reported numbers of infections go up again as of t = : by construction, however, this second wave is caused by misinterpretation of the reported number of infections. let us stress that we do not claim that the second wave is a statistical artefact due to testing strategies. it could be a statistical artefact, however. the conclusion shows how to obtain a severity index for an epidemic that is not prone to causing arti…cial results and which data is needed to compute such an index. a non-application to germany consider the case of germany. figure shows the number of tests per week and the number of reported infections. when we look at the time series for all tests in this …gure, it corresponds to t (t) from ( ). when we consider the reported number of infections per week in germany, it looks as displayed in the right panel of the above …gure. this time series corresponds to i (t) from ( ). can we conclude anything from these two time series about the true dynamics of the epidemic, i.e. can we draw conclusions aboutĨ symp (t) orĨ (t)? technically speaking, we have two equations, ( ) and ( ), reproduced here for convenience, about which the public has access to two variables, i (t) and t (t) : it seems obvious thatunless we want to make a lot of untested assumptions -o¢ cial statistics do not allow to draw one might be tempted to argue that data on the positive rate in ( ) should also be useful. as the positive rate is simply i (t) divided by t (t) ; it does not provide additional information. . cc-by-nc-nd . international license it is made available under a perpetuity. is the author/funder, who has granted medrxiv a license to display the preprint in (which was not certified by peer review) preprint the copyright holder for this this version posted october , . . any conclusion about the severity of the epidemic. the right-hand side contains at least three unknowns (e.g. tests classi…ed by reason of testing, t d (t) ; t r (t), t t (t)) and two equations with three unknowns usually do not have a solution. hence, from currently available data, true epidemic dynamics,Ĩ symp (t) orĨ (t) ; cannot be understood. the positive rate the positive rate is the ratio of con…rmed infections to number of tests, s pos (t) i (t) =t (t) : this statistic is often discussed in the media and elsewhere (see e.g. our world in data, ). in our model, ( ) and ( ) imply what does this positive rate tell us? some argue that a rising positive rate is a sign of the epidemic 'getting worse'. if we understand the latter by a rise in the number of unobserved infections,Ĩ (t) ; or the number of infections with symptoms,Ĩ symp (t) ; this statement is true if we undertake representative testing only, t d c = t t (t) = as in ( ). in this case, when the observed positive rate s pos (t) rises, this clearly indicates that the number of unobserved infectionsĨ (t) is higher. and so would beĨ symp (t) ; whether individuals with symptoms go to a doctor or not, given the constant share s of symptomatic cases in ( a). does this conclusion hold more generally, i.e. for the full expression ( ) when tests are undertaken for many reasons? let us assume we only undertake tests due to symptoms and due to travelling, t r (t) = . then the positive rate ( ) reads when we increase the number of tests for travellers, we …nd (see appendix) this result is easy to understand technically and has the usual structure: when we increase a summand (t t (t) here) that appears in a fraction in numerator and denominator, the sign of the derivative depends on the other summands ( n i= p i d i (t) and p cĨsymp (t) in this case). as the summand is multiplied by p t (t) in the numerator, this probability appears in the condition as well. in terms of epidemiological content, the derivative says that the positive ratio can rise or fall when we increase the number of tests for travellers (or related reasons mentioned below …gure ). testing increases the positive rate if the number of tests undertaken due to symptoms this paper is about conceptional issues related to the …nding an unbiased estimator for an unobserved time series. we ignore practical data problems. the latter include the fact that the number of tests displayed in …g. is not coming from the same sample of tests that yields the number of infections in this …gure. this would have to be taken into account in any application. the appendix shows that the positive rate is also informative and identical toĨ (t) =p if travellers (or visitors of retirement homes, or contact persons of a positively tested individual or visitors of public events etc) are representative. this assumption is questionable, however. quantitatively speaking, representative testing is probably very small relative to other reasons for testing. that are not cov- related, n i= p i d i (t) ; exceeds the number of tests undertaken because of symptoms related to cov- , p cĨsymp (t) ; corrected for the probability that a traveller test is positive. while an intuitive interpretation of this condition seems to be a challenge, the condition nevertheless conveys a clear message: it contradicts that a rising positive rate implies a 'worse' epidemic state. we see that when t t (t) goes up and the positive rate goes up, this does not mean anything regarding the dynamics ofĨ symp (t) orĨ (t) : the same is true, of course, when t t (t) goes up and the positive rate goes down. the positive rate is not informative. this …nding also applies to a somewhat more precise statement of the above conjecture. some claim that a rising positive rate in the presence of more tests does show that infections must go up. when we increase t t (t) ; the number of tests goes up. when ( ) holds, the positive rate goes up. however, we do not learn anything about infections with or without symptoms. tests go up and the positive rate goes up simply because we test more. we now propose an index for the severity of an epidemic which is comparable over time. the model illustrated in …gure tells us what is needed: the index should be closely related to the number of symptoms in society. as tests that capture these symptoms are those that are undertaken because of symptoms, the index is simply i d (t) as in ( ). an alternative would consist in representative testing. while the number of reported cases depends causally on the number of tests, the ratio of reported cases to number of tests is an unbiased estimator of the true epidemic dynamics as shown in ( ). as regular representative testing, say with a weekly frequency, is not feasible, the only realistic severity index is i d (t) from ( ). very simply speaking: if a severity index for an epidemic is desired that is comparable over time, we should test for cov- but count covid- cases. this should be done at all levels starting from the gp, through hospital admissions and patients in intensive care and, …nally, counting deaths associated with covid- . what do these …ndings mean in practice? data which is currently available for the public (see e.g. rki, for germany or our world in data, , for many other countries in the world) does break down the total numbers of tests by origin (gp, hospital and other) and region. unfortunately, this classi…cation does not relate to the reason for testing and the latter is required to infer the true infection dynamics. what should be done to quantify the relevance of the bias? local health authorities in germany collect the names of individuals with con…rmed cov- infections. if additional information on symptoms, that is already being collected (the reporting form allows for ticks on fever, coughing and the like), was made available to the public or scientists, the bias could be computed easily. we currently know covid- cases for intensive care in hospitals, but this data is not yet easily accessible (see https://www.intensivregister.de). while only a fraction of covid- cases ends up in intensive care, this number might be more informative than cov- infections. the number of deaths associated with covid- is a further measure as would be excess mortality. while these are only partial measures of covid- dynamics, covid- measures (positive cov- tests with symptoms, number of all covid- patients in hospitals, not just intensive care) would provide a better basis for regional and local decision makers than cov- infection measures. if one day we know how strong the quantitative bias is, we could now hope that the bias is small. then the guidance given to society by the focus on cov- infections would have been correct. but even with a small bias, the focus on cov- infections should stop. we know that it is not the perfect measure. it rather biases expectation building (and emotional reactions) of individuals. hence, as soon as better covid- measures are available, the cov- measure should be replaced. the candidates are estimates of informative positive rates and (regional) time series on covid- cases (and not cov- infections). this would allow local politicians to base their decisions on intertemporally informative data, i.e. on local covid- cases. true epidemic dynamics are unobserved. no country, no health authority and no scientist knows the true number of cov- infections with or without symptoms for a given country. this is why testing is undertaken. testing is a means to measure true but unobserved epidemic dynamics. the counted number of cov- infections are not relevant for decision making, what matters is the true number of cov- infections. infections and the corresponding disease spreads when the true number of infections is high, not when the counted number of infections is high. we extend the classic sir model to take symptomatic and asymptomatic cases into account. more importantly, we treat cov- infections as unobserved in the sir model and model testing. we allow for various reasons for testing and focus on testing due to symptoms, representative testing and testing travellers. testing travellers is an example of non-representative and nonsymptom related testing and includes the testing of sports professionals, in retirement homes or their visitors, in hot spots or contact persons of infected individuals. we show that the presence of various reasons for testing biases the number of con…rmed cov- infections over time. the number of cov- infections cannot be compared intertemporally. we might observe more cov- infections today simply because we test more. however, the true number of infections might stay constant or even fall. we do not claim, in any sense, that our …ndings have empirical relevance. we simply do not know, at least given the data that are easily accessible to the public and given the data everybody observes (number of tests and number of reported infections) and on which all public health decisions are based, what the true epidemic dynamic is. we all look at a watch and we know that it is wrong. but we do not know how much it is wrong. it may be seconds, but it can also be hours. what are the positive lessons from this analysis? we propose an index which is unbiased over time. it is deceptively simple. count the number of covid- cases, not the number of cov- infections. if we knew the number of covid- cases, i.e. cov- infections with severe acute respiratory symptoms (sars), then we would know at least one part of epidemic dynamics (Ĩ symp (t) in our model). let us stress that our …ndings are not an argument against testing. testing is important for identifying infectious individuals. they need to stay in quarantine in order to prevent the further spread of cov- infections. this helps to reduce covid- cases. testing is important -but adding up con…rmed infections from all sorts of tests is misleading. as long as the public focuses on all sources of positive cases, decisions by private individuals, …rms, journalists, scientists and politicians are badly informed. emotions, decisions and behaviour are misguided. this cannot be good for public health. decisions must be based on the number of covid- cases. . cc-by-nc-nd . international license it is made available under a perpetuity. is the author/funder, who has granted medrxiv a license to display the preprint in (which was not certified by peer review) preprint the copyright holder for this this version posted october , . . https://doi.org/ . / . . . doi: medrxiv preprint . cc-by-nc-nd . international license it is made available under a perpetuity. is the author/funder, who has granted medrxiv a license to display the preprint in (which was not certified by peer review) preprint the copyright holder for this this version posted october , . . https://doi.org/ . / . . . doi: medrxiv preprint this sections contains derivations for the main text. when we ignore testing by symptoms, the positive rate ( ) reads t r (t) + t t (t) : imagine travellers were representative, then p t (t) =Ĩ (t) p and the positive rate would read s pos (t) =Ĩ (t) p as in ( ) for representative testing. under the assumption that travellers (or visitors of retirement homes, or contact persons of a positively tested individual or visitors of public events) are representative, the positive rate would re ‡ect the true epidemic dynamics as measured bỹ i (t) =p: the derivative of the positive rate in ( ) we only take tests due to symptoms and due to travelling into account, t r (t) = . then the positive rate ( ) reads s pos (t) = p cĨsymp (t) + p t (t) t t (t) n i= p i d i (t) + p cĨsymp (t) + t t (t) a + p t t t b + t t ; where the second equality de…nes a and b (for this appendix only) and suppresses time arguments to simplify notation. we compute ds pos dt t = p t b + t t a + p t t t (b + t t ) > , p t b + t t > a + p t t t , p t b > a: when we employ the de…nition of a and b; we obtain adding time arguments gives the condition in the main text. . cc-by-nc-nd . international license it is made available under a perpetuity. is the author/funder, who has granted medrxiv a license to display the preprint in (which was not certified by peer review) preprint the copyright holder for this this version posted october , . . https://doi.org/ . / . . . doi: medrxiv preprint e¤ects of nonpharmaceutical interventions on covid- cases, deaths, and demand for hospital services in the uk: a modelling study inferring covid- spreading rates and potential change points for case number forecasts aggregate demand management in search equilibrium projecting the spread of covid for germany should contact bans have been lifted more in germany? a quantitative prediction of its e¤ects impact of non-pharmaceutical interventions (npis) to reduce covid mortality and healthcare demand exact analytical solutions of the susceptible-infected-recovered (sir) epidemic model and of the sir model with equal death and birth rates the mathematics of infectious diseases generalizations of the linear chain trick: incorporating more ‡exible dwell time distributions into mean …eld ode models an introduction to stochastic modeling proceedings of the royal society of london series a, containing papers of a mathematical and physical character the incubation period of coronavirus disease (covid- ) from publicly reported con…rmed cases: estimation and application incubation period and other epidemiological characteristics of novel coronavirus infections with right truncation: a statistical analysis of publicly available case data uses and abuses of mathematics in biology coronavirus (covid- ) testing mathematical models to guide pandemic response estimation of the cancer risk induced by therapies targeting stem cell replication and treatment recommendations face masks considerably reduce covid- cases in germany -a synthetic control method approach property rights and e¢ ciency in mating, racing, and related games estimating unobserved sars-cov- infections in the united states short-run equilibrium dynamics of unemployment vacancies, and real wages laborbasierte surveillance sars-cov- stochastic processes susceptible-infected-recovered (sir) dynamics of covid- and economic impact competing interests: there are no competing interests. data and materials availability: all data and software code key: cord- -tv nhojk authors: eltoukhy, abdelrahman e. e.; shaban, ibrahim abdelfadeel; chan, felix t. s.; abdel-aal, mohammad a. m. title: data analytics for predicting covid- cases in top affected countries: observations and recommendations date: - - journal: int j environ res public health doi: . /ijerph sha: doc_id: cord_uid: tv nhojk the outbreak of the novel coronavirus disease (covid- ) has adversely affected many countries in the world. the unexpected large number of covid- cases has disrupted the healthcare system in many countries and resulted in a shortage of bed spaces in the hospitals. consequently, predicting the number of covid- cases is imperative for governments to take appropriate actions. the number of covid- cases can be accurately predicted by considering historical data of reported cases alongside some external factors that affect the spread of the virus. in the literature, most of the existing prediction methods focus only on the historical data and overlook most of the external factors. hence, the number of covid- cases is inaccurately predicted. therefore, the main objective of this study is to simultaneously consider historical data and the external factors. this can be accomplished by adopting data analytics, which include developing a nonlinear autoregressive exogenous input (narx) neural network-based algorithm. the viability and superiority of the developed algorithm are demonstrated by conducting experiments using data collected for top five affected countries in each continent. the results show an improved accuracy when compared with existing methods. moreover, the experiments are extended to make future prediction for the number of patients afflicted with covid- during the period from august until september . by using such predictions, both the government and people in the affected countries can take appropriate measures to resume pre-epidemic activities. by january , the covid- outbreak that originated in china has spread globally, with the number of infected persons rising to , , and a fatality of about , persons (https://www.worldometers.info/ coronavirus. last accessed june ). with the global spread of this infectious disease, the world health organization (who) designated it as a pandemic. besides the personal tragedies and casualties brought by this pandemic, the economic implications of this pandemic are significant. most of the affected countries locked their borders and ordered closure of factories, restaurants, big malls, and clubs. consequently, the world is suffering from an economic recession as the global economic losses are estimated to approach usd trillion (the economist, "covid carnage," march ). this dire situation motivates researchers to conduct research on covid- focusing on two main areas: medicine and engineering. of the narx neural network-based algorithm is described in section . sections and present the results of the experiments and conclusions of the study, respectively. before investigating the literature on covid- , we conducted a brief bibliographic search about covid- for two purposes. firstly, to find out the number of research works published on covid- and secondly, to identify the different research areas focusing on covid- . for those purposes, we used some keywords like covid- , novel coronavirus, and hubei pneumonia. it was found that more than research documents have been published on this topic. figure shows the different types of research documents published on covid- . by looking at figure , it can be observed that the vast majority of published works are in the form of journal articles, whereas a small number of research works have appeared as conference papers. this is because most conferences have been canceled due to the outbreak of the covid- pandemic [ ] . before investigating the literature on covid- , we conducted a brief bibliographic search about covid- for two purposes. firstly, to find out the number of research works published on covid- and secondly, to identify the different research areas focusing on covid- . for those purposes, we used some keywords like covid- , novel coronavirus, and hubei pneumonia. it was found that more than research documents have been published on this topic. figure shows the different types of research documents published on covid- . by looking at figure , it can be observed that the vast majority of published works are in the form of journal articles, whereas a small number of research works have appeared as conference papers. this is because most conferences have been canceled due to the outbreak of the covid- pandemic [ ] . bibliographic search is then continued to identify the different research areas focusing on covid- . the findings are presented as a pie chart in figure . as covid- is a novel disease, it is noticed that the majority of these research works (around %) has focused on medicine, whereas the rest are distributed in different areas, like biochemistry, social sciences, and engineering. these research areas are discussed in the next section. bibliographic search is then continued to identify the different research areas focusing on covid- . the findings are presented as a pie chart in figure . as covid- is a novel disease, it is noticed that the majority of these research works (around %) has focused on medicine, whereas the rest are distributed in different areas, like biochemistry, social sciences, and engineering. these research areas are discussed in the next section. in this section, we discuss the different research areas that have considered covid- , including medicine and engineering. in the field of medicine, most of the early research on covid- is focused on understanding the symptoms of the disease [ ] , characterizing it [ ] , and finally estimating its incubation periods [ ] . in addition, wang et al. [ ] have reported that the elderly are more likely to die from covid- , because of their underlying comorbidities [ ] . however, the virus attacks not only the elder people but also children [ ] . this means that everybody can be infected by covid- . moreover, zhuang et al. [ ] have showed that people infected with covid- are asymptomatic in many cases. in this section, we discuss the different research areas that have considered covid- , including medicine and engineering. in the field of medicine, most of the early research on covid- is focused on understanding the symptoms of the disease [ ] , characterizing it [ ] , and finally estimating its incubation periods [ ] . in addition, wang et al. [ ] have reported that the elderly are more likely to die from covid- , because of their underlying comorbidities [ ] . however, the virus attacks not only the elder people but also children [ ] . this means that everybody can be infected by covid- . moreover, zhuang et al. [ ] have showed that people infected with covid- are asymptomatic in many cases. covid- has a long incubation period of to days, and in many cases, patients are asymptomatic [ ] ; thus, it has a high infection rate. therefore, it is of great importance to predict and estimate the number of people affected by covid- . this motivates researchers to focus on the engineering aspect of this disease, that is, the prediction of covid- cases. usually, prediction can be conducted using traditional statistical methods. for example, remuzzi and remuzzi [ ] and tuite et al. [ ] have utilized the statistical methods to predict the number of covid- cases in italy. similarly, the number of covid- cases has been predicted in different countries/territories such as, iran [ , ] , spain, and france [ ] . beside the statistical methods as mentioned above, the mathematical modeling and simulation including logistic growth and susceptible-infected-recovered (sir-model) have been utilized to predict the new covid- cases in china [ ] and saudi arabia [ ] . moreover, papastefanopoulos et al. [ ] have investigated and compared the accuracy of six time-series forecasting approaches, namely, arima, holt-winters additive model (hwaas), tbat, facebook's prophet, deepar, and n-beats, in predicting the progression of covid- . in a similar work, hernandez-matamoros et al. [ ] have developed an arima model to predict the spread of the virus. the developed model consists of arima parameters, including the population of the country, the number of infected cases, and polynomial functions. ivorra et al. [ ] have proposed a new mathematical model for predicting the spread covid- outbreak in china. the proposed model, θ-seihrd model, considers a fraction θ of detected cases over the realized total infected cases. there are other studies that focus on collecting and analyzing posts related to covid- from social media sites. this is because keyword search trends related to covid- on search engines proved tremendously helpful in predicting and monitoring the spread of the virus outbreak. qin et covid- has a long incubation period of to days, and in many cases, patients are asymptomatic [ ] ; thus, it has a high infection rate. therefore, it is of great importance to predict and estimate the number of people affected by covid- . this motivates researchers to focus on the engineering aspect of this disease, that is, the prediction of covid- cases. usually, prediction can be conducted using traditional statistical methods. for example, remuzzi and remuzzi [ ] and tuite et al. [ ] have utilized the statistical methods to predict the number of covid- cases in italy. similarly, the number of covid- cases has been predicted in different countries/territories such as, iran [ , ] , spain, and france [ ] . beside the statistical methods as mentioned above, the mathematical modeling and simulation including logistic growth and susceptible-infected-recovered (sir-model) have been utilized to predict the new covid- cases in china [ ] and saudi arabia [ ] . moreover, papastefanopoulos et al. [ ] have investigated and compared the accuracy of six time-series forecasting approaches, namely, arima, holt-winters additive model (hwaas), tbat, facebook's prophet, deepar, and n-beats, in predicting the progression of covid- . in a similar work, hernandez-matamoros et al. [ ] have developed an arima model to predict the spread of the virus. the developed model consists of arima parameters, including the population of the country, the number of infected cases, and polynomial functions. ivorra et al. [ ] have proposed a new mathematical model for predicting the spread covid- outbreak in china. the proposed model, θ-seihrd model, considers a fraction θ of detected cases over the realized total infected cases. there are other studies that focus on collecting and analyzing posts related to covid- from social media sites. this is because keyword search trends related to covid- on search engines proved tremendously helpful in predicting and monitoring the spread of the virus outbreak. qin et al. [ ] developed a prediction technique based on the lagged series of social media search indexes to forecast the number of new suspected covid- cases. the considered social media search indexes include common covid- symptoms such as dry cough, fever, pneumonia, etc. in another study by li et al. [ ] , the daily trend data related to specific keyword search such as "coronavirus" and "pneumonia", has been acquired from google trends, baidu index, and sina weibo index search engines to investigate and monitor new covid- cases. li et al. [ ] have collected data on the posts related to covid- that are posted by chinese users on weibo using an automated python programming script. the collected data have been analyzed quantitatively and qualitatively in order to recognize trends and characterize key themes. other applications using social media to predict covid- cases have been reported by shen et al. [ ] and ayyoubzadeh et al. [ ] the major drawback of statistical methods and mathematical modeling is their inability to consider massive amounts of data. this leads to poor prediction of the number of covid- cases. this drawback can be avoided by using data analytics, which is explained in the next section. data analytics is one of the efficient tools in discovering the relationships, trends, and other useful information existing in a body of data. the number of data analytics tools is large. among these tools, the neural network is one of the most efficient tools in uncovering the relationship between an output (i.e., response) and multiple inputs (i.e., indicators) [ , ] . this efficiency has been applied in handling different applications, including stock price forecasting in the financial industry [ ] , flight delay prediction in aviation industry [ ] [ ] [ ] , organ prediction in healthcare sector [ ] , and demand forecasting in the railway industry [ ] . these previous studies reveal the importance of data analytics for prediction purposes. this motivates researchers to adopt data analytics in the domain of covid- . for example, chen et al. [ ] utilized data analytics to predict the number of covid- cases to avoid overwhelming hospital capacity in taiwan. the pitfall of this research work is that it has only focused on historical data of the number of covid- cases while considering a limited number of factors, like travel and occupation. another research work by zhou et al. [ ] coupled geographic information system (gis) and data analytics together to identify the infection network of covid- . additionally, machine learning and artificial intelligence tools have been utilized by many studies to develop covid- prediction approaches. wieczorek et al. [ ] have developed a forecasting model for covid- new cases based on the deep architecture of neural network using nadam training model. however, the pitfall of this study is the focus on one dataset, called total number of confirmed covid- cases, while overlooking many other factors. magesh et al. [ ] have proposed an ai-based algorithm for predicting covid- cases using a hybrid recurrent neural network (rnn) with a long short-term memory (lstm) model. the authors have conducted their experiments while considering some demographic factors like sex, age, and temperature. indeed, many other social factors were not considered in their model. pinter et al. [ ] have developed a hybrid machine learning approach to forecast covid- cases in hungary. the proposed hybrid approach encompasses the adaptive network-based fuzzy inference system and multi-layered perceptron-imperialist competitive algorithm. a machine learning-based approach for predicting covid- new cases has been proposed in the study by tuli et al. [ ] , who have used an iterative weighting for fitting generalized inverse weibull distribution. for extensive study and more details about the forecasting approaches for covid- , the interested readers are referred to the work by bragazzi et al. [ ] who have reviewed the potentials of applying artificial intelligent and big data based approaches in predicting and managing the covid- pandemic outbreak. these previous studies show successful application of data analytics in multiple areas. therefore, it is reasonable to use data analytics in this study. from the above, it is clear that most of the data analytics studies have focused on historical data of confirmed covid- cases, while some studies have considered some factors like temperature and patient sex. indeed, many other important external factors that affect the spread of the disease have been completely ignored. these important factors include population, median age index, public and private healthcare expenditure, air quality as a co trend, seasonality as month of data collection, number of arrivals in the country/territory, and education index. this results in a poor prediction of the number of covid- cases. a thorough examination of the literature reveals some observations, which can be outlined as follows. first, there is no previous study that simultaneously considers the historical data of the number of covid- cases and most of the external factors that affect the spread of the virus. secondly, there is no research work that provides future prediction of the number of covid- cases using data analytics techniques. therefore, efforts of the government to improve the healthcare system in the affected countries are greatly hampered. consequently, in this research work, we have tried to fill this gap by proposing a data analytics algorithm, in which all the aforementioned features can be simultaneously considered. this paper has the following contributions. firstly, in contrast to the existing approach [ , ] , which only focuses on the historical data of persons infected with covid- , we propose a more robust approach. our approach simultaneously considers the historical data of covid- cases alongside most of the external factors that affect the spread of the disease. these external factors include population, median age index, public and private healthcare expenditure, air quality as a co trend, seasonality as month of data collection, number of arrivals in the country/territory, and education index. to consider all those massive number of factors, we develop a nonlinear autoregressive exogenous input (narx) neural network-based algorithm. this algorithm is developed because it is the most appropriate one to handle time-based factors, like the number of covid- cases. moreover, narx algorithms have been successfully applied in different research areas, as shown in section . . second, instead of predicting the number of covid- cases in one or two countries [ ] [ ] [ ] , we use our algorithm to predict the number of covid- in multiple countries, including top five affected countries in each continent. this is fruitful as it gives wide information about the spread of covid- in different parts of the world. lastly, it has been observed in the literature that most research papers have not provided future prediction of the number of covid- cases. as opposed to these previous research papers, we use the trained data produced from our algorithm to make future prediction of the number of covid- cases. by using such predictions, both the government and people in the affected countries can take appropriate measures to resume pre-epidemic activities. in this section, we present how data analytics can be used in predicting the new daily cases of covid- . instead of using the traditional approaches, which either focus on historical data or assume a normal distribution for the number of daily cases, we use a data analytics approach. in particular, this approach has the ability to consider a massive amount of data, including historical data of daily cases besides other external factors. the proposed methodology includes a nonlinear autoregressive exogenous input (narx) neural network-based algorithm. the main steps of this algorithm are presented as follows: step : collecting the data. the data have been collected from online websites, including "worldometers" [ ] , "our world in data" [ ] , "world bank open data" [ ] , and the official website of the world health organization (who). besides, human development reports have been used to pick other kind of information, like median age and education index [ ] . the scope of this study includes collecting data for about countries/territories by focusing on two types of data: main data and other external factors. the main data include considering the number of confirmed coronavirus disease cases/day, the number of deaths due to coronavirus disease/day, and the total number of confirmed cases [ , ] . the external factors, on the other hand, include considering the factors that affect the spread of coronavirus disease. note that the data have been collected for about days, from december until august . this leads naturally to set the size of data at . step : preprocessing the data. while collecting the data, it was observed that data were not always available for the whole countries/territories. to alleviate this situation, a refinement was performed by ruling out any countries/territories that suffer from data unavailability. this results in cancelling around countries/territories, so that only countries/territories have been considered. our preliminary goal is to predict the new cases for all countries/territories. however, this is not reasonable for two reasons. firstly, it is not possible to present all the results in a single study due to page limitation. secondly, it is computationally expensive to run this algorithm for countries/territories. for the above reasons, we have limited our scope to considering the most affected countries in each continent. by doing so, the top five affected countries/territories have been considered from each continent. more details are presented in section . step : identifying the input sets. these sets contain historical data of some information alongside the external factors. these sets can be outlined as follows: i. main set, which includes two main information: the number of deaths due to coronavirus disease/day and the total number of confirmed cases; ii. external factor set that comprises the factors that affect the spread of coronavirus including population [ ] , median age index [ ] , public and private healthcare expenditure [ ] , air quality as a co trend [ ] , number of arrivals in the countries/territories [ ] , and education index [ ] . there is another factor that should be considered, called seasonality. before incorporating this factor in the model, it should be clarified here that, in most countries, we can find cities with different seasons. for example, iran has four seasons in its different cities [ ] . other examples include usa, china, saudi arabia, and egypt. this observation indicates that, to consider the seasonality using seasons as a factor, the cities should be the scope of the study. since the scope of this study is not cities but the countries, seasonality factor using seasons themselves cannot be considered in our algorithm. to find a compromise for this situation, the month of collecting data is selected to capture the seasonality in the proposed algorithm. it should be noted that the main data for the daily covid- cases have been collected from the website "our world in data", corrected through the website "worldometers". next, the data have been doublechecked and refined by the data from the official website of who. in addition, because the considerable predictors are diverted, and they are not available on one database, their data have been collected from several websites. in further details, the data that have been collected from the website "world bank open data" are the median age, number of arrivals, and health expenditure as a percentage of gdp [ ] , while the education index has been collected from united nations development programme [ ] . step : test of hypothesis using regression analysis. since our study goal is to accurately predict the number of covid- cases, we should focus on the most influential external factors. to do so, test of hypotheses using regression analysis should be conducted for each external factor. these hypotheses can be outlined as follows: after outlining the test of hypothesis, the regression analysis has been conducted, in which the p-value is calculated. if the p-value < . , which is the significance level in this study, we reject the null hypothesis h and go in favor of the alternative hypothesis h . if p-value ≥ . , we cannot reject the null hypothesis h . by doing so, the significant factors have been picked, including all the previous external factors except public health expenditure and air quality as a co trend. more details about test of hypothesis using regression analysis are shown in section . . step : designing the neural network structure. we have utilized the feedforward time-delay neural network as this structure has been commonly used in the literature due to its efficiency [ ] . this network is composed of three main layers: input, hidden, and output. regarding the activation function, the sigmoid function has been selected because it is efficient in reflecting the non-linear relationship among multiple factors. step : training the neural network. to achieve this goal, the supervised learning method has been adopted. in this method, % of the data have been used for training purposes, whereas the rest have been reserved for validation and testing purposes. step : predicting new cases of covid- . the trained data, known as the output of the network, have been used to predict the new cases of covid- in the period from august until september . figure represents diagrammatically the structure of the neural network. it should be noted that neural network is a common artificial intelligence technique. indeed, this technique uses the idea of information flow between brain neurons, which is represented as a network via arrows and nodes. arrows represent the input details and the output information, whereas nodes stand for the neurons. usually, the nodes or neurons receive the input data, then analyze it to give suitable outputs. this straightforward movement of data from several input points is the simplest way to obtain an output. such network structure is called feedforward neural network (ff), which has been used in our algorithm. usually, feedforward neural network is either single layer feedforward neural network, as shown in the left-hand side of figure , or multiple layers feedforward neural network, as shown in the right-hand side of figure . in multiple layers feedforward neural network, the input layer is indirectly connected with the output layer by means of hidden layers (i.e., each layer in the network is in connection with the next layer). in particular, the input is connected to the first hidden layer, and this layer is connected to the next hidden layer. these connections move forward in this sequence until reaching to the output layer. as mentioned earlier, the neural network structure adopted in this research is feedforward neural network with multiple layers. the type of the neural network is narx neural network. the analysis of this network is based on the time-series modeling [ ] . this means that it uses data obtained at successive times in the past in order to predict data in the future. therefore, it is commonly used as a predicting tool in different fields, such as predicting the solar radiations per day [ ] , predicting electricity price of day-ahead [ ] , and the prediction of bearing life [ ] . as any neural network, input data are processed in the narx neural network through the nodes using the following function: where a(t) is the output of the narx neural network at time t. on the other hand, the values a(t − ), a(t − ), . . . ., a(t − n) are the outputs of the narx neural network in the past, whereas n a is the number of delays in the output. the values b(t − ). b(t − ), . . . , b(t − n b ) are the inputs of narx neural network, and n b is the delay in the inputs. from this equation, it is clear that, in order to get an output a(t) at time t, not only the input data is used but also the output data of the past should be used as well. for example, in order to predict the number of covid- tomorrow a(t), the input data besides, the predicted data of today and the past few days a(t − ), a(t − ), . . . ., a(t − n a ) will be used as well. as mentioned earlier, the neural network structure adopted in this research is feedforward neural network with multiple layers. the type of the neural network is narx neural network. the analysis of this network is based on the time-series modeling [ ] . this means that it uses data obtained at successive times in the past in order to predict data in the future. therefore, it is commonly used as a predicting tool in different fields, such as predicting the solar radiations per day [ ] , predicting electricity price of day-ahead [ ] , and the prediction of bearing life [ ] . as any neural network, input data are processed in the narx neural network through the nodes using the following function: after presenting the algorithm, some questions might be asked. one of these questions is "is seasonality a factor that might have an impact on the results?". before answering this question, it should be clarified here that, in most countries, we can find cities with different seasons. for example, iran has four seasons in its different cities [ ] . other examples include usa, china, saudi arabia, and egypt. this observation indicates that, to consider the seasonality as a factor, the cities should be the scope of the study. since the scope of this study is not cities but countries, we have considered the month of data collection as a measure of seasonality to overcome the above situation. the proposed algorithm deals with variable population size, meaning that countries with higher population size impact more on the algorithm than lower population size countries, which can induce a high uncertainty in the predictions. the question here is "how this fluctuation was accounted for in the algorithm?" indeed, to avoid the high fluctuation in the predictions, the best parameter setting input layer output layer output layer hidden layers single layer multiple layers after presenting the algorithm, some questions might be asked. one of these questions is "is seasonality a factor that might have an impact on the results?". before answering this question, it should be clarified here that, in most countries, we can find cities with different seasons. for example, iran has four seasons in its different cities [ ] . other examples include usa, china, saudi arabia, and egypt. this observation indicates that, to consider the seasonality as a factor, the cities should be the scope of the study. since the scope of this study is not cities but countries, we have considered the month of data collection as a measure of seasonality to overcome the above situation. the proposed algorithm deals with variable population size, meaning that countries with higher population size impact more on the algorithm than lower population size countries, which can induce a high uncertainty in the predictions. the question here is "how this fluctuation was accounted for in the algorithm?" indeed, to avoid the high fluctuation in the predictions, the best parameter setting for the algorithm should be used, while adopting the proposed algorithm in prediction [ ] . for this purpose, the taguchi method has been adopted, as shown in section . . after presenting the narx neural network-based algorithm that helps in predicting the new cases of covid- , it is necessary to present the effectiveness of the algorithm. for this purpose, some experiments are conducted while considering the top five affected countries from each continent, as shown in table . note that the experiments of this case study have been performed using an intel i cpu and . ghz clock speed laptop. the memory is gb ram and runs the windows software. in addition, the algorithm is coded in matlab a. the results of experiments are presented in the following subsections. before conducting the experiments of this study, we have collected the external factors that seems to affect the spread of coronavirus. these factors include population, median age index, public and private healthcare expenditure, air quality as a co trend, seasonality as month of data collection, number of arrivals in the countries/territories, education index, and the month of collecting data. since our study goal is to accurately predict the number of covid- cases, only the most influential external factors should be considered. towards this goal, the test of hypothesis using regression analysis has been adopted using minitab software [ , ] , in which the number of covid- cases and their related external factors have been collected for about countries/territories. note that the regression analysis has been conducted with a significance level of % [ ] . the results of hypothesis test are summarized in table . by looking at the results presented in table , it is noticed null hypothesis h related to hypotheses # , , , , , and is rejected, and the alternative hypothesis h is picked. this means that the external factors like population, median age index, private healthcare expenditure, number of arrivals, education index, and month of collecting data have a significant effect on the number of covid- cases. this is because p-values of these factors, which appear in boldface, are lower than the significance level, which is % in this study. in contrast, the null hypothesis h related to hypotheses # and cannot be rejected, meaning that alternative hypothesis h is rejected. this indicates that external factors like public healthcare expenditure and co trend do not have significant effect on the number of covid- cases. based on the above test of hypothesis, our experiments are further conducted while considering only the significant external factors, meaning considering all the factors except public healthcare expenditure and co trend. after selecting the most influential factors, it is the time for conducting the prediction experiments. however, before doing so, the best parameter setting of narx neural network-based algorithm should be determined. towards this end, the most influential parameters are selected, and their corresponding levels are determined [ , , ] , as shown in table . to select the best parameter settings, taguchi method has been utilized as it is one of the effective tools in determining the best parameter settings by applying an orthogonal array and signal-to-noise (s/n) ratios [ , [ ] [ ] [ ] . the orthogonal array approach can be defined as an economic approach that is commonly adopted with an objective of minimizing the number of conducted experiments. the s/n ratio can be described as a performance indicator that indicates the quality of each conducted experiment. since our taguchi experiment includes four parameters with three levels, the orthogonal array l should be selected in our experiments, which have been conducted using minitab software. figure illustrates the average s/n ratio of the selected parameter at each level, while using our proposed algorithm. since our algorithm aims at predicting the covid- cases, the objective in this study is to minimize the error between the predicted and real values. based on this observation, the parameter level should be selected based on the smaller is better criterion. this means that the level with small average s/n ratio is better than the level with higher average s/n ratio. by applying this criterion in figure , the best level for the parameters , , , and should be set at are levels , , , , respectively. these levels appear in a boldface in table . in this section, we report the performance of the narx neural network-based algorithm. it should be noted that the performance of the algorithm has been evaluated using a commonly used performance indicator, called root mean square error (rmse) [ ] . the rmse is used to reflect the error between real and predicted values of covid- cases. besides, the correlation has been calculated to indicate the closeness of the predicted data to observed data. to select the suitable correlation test, the normality of the observed and predicted covid- cases should be checked. by doing so, it has been noticed that both the observed and predicted covid- cases are not normally distributed. this observation naturally leads to using spearman correlation test [ ] [ ] [ ] . to measure the model uncertainty, the error standard deviation has been calculated [ ] . details of the results are presented in table . by looking at table , it is noticed that the value of the rmse is low in most african and asian countries. this is because the values of predicted and real cases are a bit low if compared with other countries. it is also observed that that the value of the rmse is a bit large in some countries like usa, spain, and china. for instance, rmse = while considering usa. at a first glance, an rmse value of can give the impression of a large difference between predicted and real values, implying a poor performance of the proposed algorithm. indeed, is not that big at all, because predicted or real values reach up to , . thus, is not a big figure if compared with , , meaning that the performance of the algorithm is still reasonable, while handling a large number of covid- cases. to summarize, we can say that the proposed algorithm produces large rmse when the real and predicted values are large, and vice versa. this indicates the consistency and robustness of the proposed algorithm. in this section, we report the performance of the narx neural network-based algorithm. it should be noted that the performance of the algorithm has been evaluated using a commonly used performance indicator, called root mean square error ( ) [ ] . the is used to reflect the error between real and predicted values of covid- cases. besides, the correlation has been calculated to indicate the closeness of the predicted data to observed data. to select the suitable correlation test, the normality of the observed and predicted covid- cases should be checked. by doing so, it has been noticed that both the observed and predicted covid- cases are not normally distributed. this observation naturally leads to using spearman correlation test [ ] [ ] [ ] . to measure the model uncertainty, the error standard deviation has been calculated [ ] . details of the results are presented in table . regarding the correlation, it is observed that the correlation factor is larger than . in all countries with p-value of zero. this means a strong positive significant correlation between the observed and predicted data, which indicates a closeness of the predicted data to the observed data. this reflects the high accuracy of the proposed algorithm. by looking at the error standard deviation, it indicates low error variability in countries characterized with low number of covid- cases and vice versa. this confirms the stability and reliability of the proposed algorithm. after presenting the performance of the proposed algorithm, there is a question that might be asked here, "what is the advantage of the proposed algorithm over the existing traditional method in the literature?". to answer this question, our experiments have been further extended to make a comparison between our proposed algorithm and the traditional method that can be represented in the study by chen et al. [ ] . note, both studies have the same objective, which is predicting the number of covid- cases. however, both studies are different in their considered factors. the study by chen et al. [ ] has only focused on historical data of the number of covid- cases while considering a limited number of factors, like travel and occupation. in contrast, our study has the same focus as the study chen et al. [ ] , besides, it has considered many external factors that overlooked in their study. these factors include population, median age index, public healthcare expenditure, private healthcare expenditure, air quality as a co trend, education index, and seasonality as month of collecting data. the experiment results obtained from both approaches are summarized in table . by looking at table , the results show that the narx neural network-based algorithm is more accurate than the traditional method. this outperformance is due to considering more factors that affect the spread of covid- , such as the external factors like the population, the health expenditures, and others. this results in an accurate prediction for the proposed algorithm. in contrast to the proposed algorithm, the traditional method only focusses on the historical data and neglect many external factors. hence, some important factors that affect the spread of the virus are neglected, leading finally to a poor prediction of the number of covid- cases. this section establishes that the proposed algorithm gives improved results when compared with the traditional method. thus, the significance of utilizing this algorithm in real practice is further affirmed. so far, we have presented the performance of the proposed algorithm and its advantage over the existing methods. it is fine, but still, there are some questions that have not been answered like "what next in the future?", "how can we benefit from the algorithm in predicting future cases of covid- ?", "when will coivd- end?". answering these questions necessitates extending our experiments, in which we use the trained data to predict the number of future cases of covid- . in the experiments of the previous sections, we observe that in the countries that control the spread of covid- , the peak in the number of daily cases appeared after - months. this observation has been taken as a reference to predict future covid- cases in the countries where the disease is yet to peak. it is interesting to recall that the future prediction has been done for about two months, in the period from august until september . the results of these experiments are presented in figures - , which represent the future predictions in europe, north and south america, asia, and africa, respectively. after presenting the future prediction of covid- cases in european countries, we have some observations, which are outlined as follows: • in most european countries, like italy, uk, and russia, the number of cases has already reached its peak before our future prediction. based on this observation, we predict that the number of future coivd- cases will decrease gradually during the period from august until september . it is worth to mention that our predicted reduction in the number of covid- cases appear during august . the abovementioned reduction is because of strict compliance with the precaution guidelines established by who. in contrast to most of european countries, the situation in spain and france is quite similar, as the number of cases has raised recently and formed another peak. in spain, the second peak has been already formed, therefore, our algorithm predicts a gradual decrease during the period from august until september . in france, the algorithm predicts a slight increase followed by a gradual decrease in the number of cases during the same period. by looking at figure , some observations can be summarized as follows: • in the case of usa, the number of covid- cases has already formed its second peak by mid of july . therefore, we have predicted a slow reduction in the number of future covid- cases. this prediction, in terms of reduction, appears in the usa during the first half of august . it should be noted that this slow reduction is because of the dysfunction experienced in the healthcare system of the usa [ ] . in the case of brazil, we observe that the peak has been reached. then, we predict that the number of future covid- cases will experience a wavy reduction during the period from august until september . it is important to mention that the predicted slow reduction in future covid- cases agrees with the actual reduction realized during august . this wavy reduction is due to overlooking the social distancing instructions by most of brazilian residents [ ] . in case of canada, the number of covid- cases has reached its peak since beginning of may . therefore, it is reasonable to predict a gradual decrease in the number of cases during the period from august until september . in the case of peru, we observe an increase in the number of covid- cases by july . then, we predict that, during the period from august until september , this increase will continue a bit before a decline in the number of future covid- cases. this increase is due to the bad behavior of the people, so that the situation becomes even worse during those days [ ] . • in the case of ecuador, we predict the number of future covid- cases will keep its wavy motion, meaning that the number will tend to zero and increase again. this wave appears because there is no transparency in the reported number of covid- cases, meaning that the government has not disclosed the real number of covid- cases [ ] . reached the peak. our algorithm predicts a gradual reduction in the number of future covid- cases, which has been realized during august . the situation in india is completely different compared to the rest of asian countries. this is because india is yet to reach its peak. based on this observation, we predict that the increase in the number of covid- cases will continue during the period from august until september . after presenting the future prediction of covid- cases in european countries, we have some observations, which are outlined as follows: • in most european countries, like italy, uk, and russia, the number of cases has already reached its peak before our future prediction. based on this observation, we predict that the number of future coivd- cases will decrease gradually during the period from august until september . it is worth to mention that our predicted reduction in the number of covid- cases appear during august . the abovementioned reduction is because of strict compliance with the precaution guidelines established by who. in contrast to most of european countries, the situation in spain and france is quite similar, as the number of cases has raised recently and formed another peak. in spain, the second peak has been already formed, therefore, our algorithm predicts a gradual decrease during the period from august until september . in france, the algorithm predicts a slight increase followed by a gradual decrease in the number of cases during the same period. by looking at figure , some observations can be summarized as follows: • in the case of usa, the number of covid- cases has already formed its second peak by mid of july . therefore, we have predicted a slow reduction in the number of future covid- cases. this prediction, in terms of reduction, appears in the usa during the first half of august . it should be noted that this slow reduction is because of the dysfunction experienced in the healthcare system of the usa [ ] . in the case of brazil, we observe that the peak has been reached. then, we predict that the number of future covid- cases will experience a wavy reduction during the period from august until september . it is important to mention that the predicted slow reduction in future covid- cases agrees with the actual reduction realized during august . this wavy reduction is due to overlooking the social distancing instructions by most of brazilian residents [ ] . in case of canada, the number of covid- cases has reached its peak since beginning of may . therefore, it is reasonable to predict a gradual decrease in the number of cases during the period from august until september . in the case of peru, we observe an increase in the number of covid- cases by july . then, we predict that, during the period from august until september , this increase will continue a bit before a decline in the number of future covid- cases. this increase is due to the bad behavior of the people, so that the situation becomes even worse during those days [ ] . in the case of ecuador, we predict the number of future covid- cases will keep its wavy motion, meaning that the number will tend to zero and increase again. this wave appears because there is no transparency in the reported number of covid- cases, meaning that the government has not disclosed the real number of covid- cases [ ] . after presenting the results of asian countries, we have the following observations: • china is one of the few cases that has fully controlled the situation. this is apparent as the number of future covid- cases is almost zero. thanks to the chinese government and medical system, strict typical quarantining measures have been implemented, thus, leading finally to overcoming this hard time. the situation in turkey, iran, and saudi arabia is like other european countries that have reached the peak. our algorithm predicts a gradual reduction in the number of future covid- cases, which has been realized during august . the situation in india is completely different compared to the rest of asian countries. this is because india is yet to reach its peak. based on this observation, we predict that the increase in the number of covid- cases will continue during the period from august until september . by looking at figure , we can draw the following observations: • most of african countries are quite similar, except morocco, as the peak of covid- cases has already appeared. it is predicted that the future number of covid- cases will decrease gradually, during august and september . so far, the trend of our predicted graph has been realized in the aforementioned african countries. in contrast to the above-mentioned african counties, in morocco, the peak of covid- cases has not yet appeared. therefore, the future number of covid- cases will continue its increase during august and september . it is important to mention that this increase has been realized during the first half of august . based on the above observations, we outline some recommendations, which are as follows: • it is recommended for the people living in the usa, brazil, ecuador, peru, and india to strictly follow the precautions instruction recommended by who. this includes quarantining infected people, whereas healthy people should stay home to avoid covid- infection, and when they go out, they should follow the rules of social distancing. it is recommended for the government and healthcare system of countries like the usa and brazil to raise their private and public health expenditures to control the number of future coivd- cases. in addition, penalties may be applied to the people who violate the instructions recommended by who. it is recommended for the government of ecuador to release the correct number of covid- cases so that the people can understand the severity of the situation and obey the health guidelines released by who. in the countries that fully or partially control the covid- like china, it is recommended for the people to keep following the medical instructions. otherwise, covid- may come back in a mutated form causing another global pandemic. this study investigates how new covid- cases can be predicted while considering the historical data of covid- cases alongside the external factors that affect the spread of the virus. to do so, data analytics was adopted by developing a nonlinear autoregressive exogenous input (narx) neural network-based algorithm. the effectiveness and superiority of the developed algorithm are demonstrated by conducting experiments using data collected for top five affected countries in each continent. the results show an improved accuracy if compared with the existing methods. moreover, the experiments are extended to make future prediction of the affected covid- cases during the period from august until september . the predicted covid- cases help in providing some recommendations for both the government and people of the affected countries. this study provides a novel way for predicting the number of covid- cases. however, there are some venues that might be suitable for future directions. for example, predicting the number of deaths could be one direction. another direction might be predicting the number of recovered people. one of the fruitful ideas is predicting the number of covid- cases in the top affected cities, while considering the seasonality factor. clinical characteristics of hospitalized patients with novel coronavirus-infected pneumonia in diagnosis, treatment, and prevention of novel coronavirus infection in children: experts' consensus statement the incubation period of coronavirus disease (covid- ) from publicly reported confirmed cases: estimation and application incubation period of novel coronavirus ( -ncov) infections among travellers from wuhan, china covid- and italy: what next? estimation of covid- outbreak size in italy estimation of coronavirus disease (covid- ) burden and potential for international dissemination of infection from iran preliminary estimation of the novel coronavirus disease (covid- ) cases in iran: a modelling analysis based on overseas cases and air travel data estimation of covid- prevalence in italy, spain, and france estimation of the final size of the covid- epidemic predicting the epidemiological outbreak of the coronavirus disease (covid- ) in saudi arabia covid- : a comparison of time series methods to forecast percentage of active cases per forecasting of covid per regions using arima models and polynomial functions prediction of number of cases of novel coronavirus (covid- ) using social media search index retrospective analysis of the possibility of predicting the covid- outbreak from internet searches and social media data data mining and content analysis of the chinese social media platform weibo during the early covid- outbreak: retrospective observational infoveillance study big data integration and analytics to prevent a potential hospital outbreak of covid- in taiwan covid- : challenges to gis with big data neural network powered covid- spread forecasting model covid- pandemic prediction for hungary how big data and artificial intelligence can help better manage the covid- pandemic covid- " or "covid- " or "hubei pneumonia" or updated understanding of the outbreak of novel coronavirus ( -ncov) in wuhan novel coronavirus (covid- ) epidemic: what are the risks for older patients? novel coronavirus infection in hospitalized infants under year of age in china mathematical modeling of the spread of the coronavirus disease (covid- ) taking into account the undetected infections. the case of china using reports of symptoms and diagnoses on social media to predict covid- case counts in mainland china: observational infoveillance study predicting covid- incidence through analysis of google trends data in iran: data mining and deep learning pilot study neuro-adaptive cooperative tracking control with prescribed performance of unknown higher-order nonlinear multi-agent systems adaptive synchronisation of unknown nonlinear networked systems with prescribed performance a deep increasing-decreasing-linear neural network for financial time series prediction data analytics in managing aircraft routing and maintenance staffing with price competition by a stackelberg-nash game model robust aircraft maintenance routing problem using a turn-around time reduction approach cascading delay risk of airline workforce deployments with crew pairing and schedule optimization a healthcare analytic methodology of data envelopment analysis and artificial neural networks for the prediction of organ recipient functional status neural network based temporal feature models for short-term railway passenger demand forecasting pervasive computing in the context of covid- prediction with ai-based algorithms predicting the growth and trend of covid- pandemic using machine learning and cloud computing covid- coronavirus pandemic our world in data. the covid- pandemic slide deck world bank open data united nations development programme country of four seasons: iran a world inside a country long-term time series prediction with the narx network: an empirical evaluation time series prediction based on narx neural networks: an advanced approach a nonlinear autoregressive exogenous (narx) neural network model for the prediction of the daily direct solar radiation on the importance of the long-term seasonal component in day-ahead electricity price forecasting with narx neural networks the use of md-cumsum and narx neural network for anticipating the remaining useful life of bearings joint optimization using a leader-follower stackelberg game for coordinated configuration of stochastic operational aircraft maintenance routing and maintenance staffing a regression-based approach for testing significance of "just-about-right" variable penalties robust design of multilayer feedforward neural networks: an experimental approach optimizing the parameters of multilayered feedforward neural networks through taguchi design of experiments modelling three-echelon warm-water fish supply chain: a bi-level optimization approach under nash-cournot equilibrium temperature and precipitation associate with covid- new daily cases: a correlation study between weather and covid- pandemic in oslo impact of weather on covid- pandemic in turkey correlation between weather and covid- pandemic in jakarta a model with a solution algorithm for the operational aircraft maintenance routing problem heuristic approaches for operational aircraft maintenance routing problem with maximum flying hours and man-power availability considerations why the us has struggled to tackle a growing crisis. the guardian coronavirus: brazil records third-highest covid- infection level peru's coronavirus response was 'right on time'-so why isn't it working? the guardian covid- numbers are bad in ecuador. the president says the real story is even worse this article is an open access article distributed under the terms and conditions of the creative commons attribution (cc by) license the authors wish to acknowledge the support of king fahd university of petroleum and minerals. the authors declare no conflict of interest. int. j. environ. res. public health , , key: cord- -le t l g authors: nan title: pathological society of great britain and ireland. rd meeting, – july date: - - journal: j pathol doi: . /path. sha: doc_id: cord_uid: le t l g nan claire m allen, d m hansell. mary n sheppard depanmenls of diagnosbc radrology and lung palhalogy royal bromplon nahonal hean and lung hosprlal, sydney slree:. london sw np percutaneous fine needle biopsy s an established diagnostic technique for lung lesion^ afirm diagnosisofbenign versus malignant is often achieved but histological interpretation of small fragments or groups of cells is difficult manual cutting (twcut) needles provide asuperior histological specimen but are associated with a high complication rate and have been mainly used for pleural lesions this s the first prospective study to assess the feasibility of obtaining histological samplesfrom lung lesionswing apoweredcuning needle (biopty gun) we have biopsied patients using biopty gun there were no major complications histological diagnoses were obtained in patients ( malignant, benign) the malignant lesions identified included nan-small cell carcinomas. adenocarcmamas. squamous cell carcinomas, bronchiolaalveolar cell carcinomas, -cell lymphomas, small cell carcinoma. atypical camnoid and metastatic breast carcmoma. the benign es ons included sarcoidosis. clyptogenic organising pneumonia. wegener's, resolving pneumonias with chronic inflammation the rad~ologistls assessmentaftheamauntoftissueobtainedconelated withgood histology. thethreefalsenegativespecimens obtained the radtdogist noted the inadequacy of the samples ~n two cases and the third was a geographic miss percutaneous biopsy using the biopty gun is a simple and effective means of obtaining good quality histological material from lung parenchyma with a high degree of diagnostic accuracy for bath benign and malignant lesions inorganic particulate matter in "normal" lung: a study using light microscopy (lm), scanning electron microscopy (sem), and energy dispersive x-ray analysis (edxa) london, onlano. l partmeni of palhology soulh sireel london onlano n a g canada h e possible association between inhaled inorganic matter and some cases of usual inter tltial flbrosis (uipi has been of longstanding interest to pathologists and cllnoians. unfortunately. most analytical techniques are impracticable ~n the context of a pathology sewice laboratory in an anempt to find a practicable s utm to this problem and to establish a baseline for inorganic particle load in a "normal" population, a study was undertaken using techniqueswhichwauld beavailable ~nmostlargepathologylaboratories ninecaseswereselectedfram the case(centra andperipheralparliansaf lower. middleand upperlobesjandexamined with thesemand edxaand compared with the lm appearance to determine the panicle type and distrlbutm m a "normal" population. a wide range of inorganic matter was ldentlfied canespanding to siiicb. aluminium and magnesium s cate , rutile and alumina "like" particles varying from < p to p in size in addition. trace elements including zinc. cadmium and increase in number of particles wasalso noted in areas of fibrosis which were present in two cases (old mflammatory disease) preseumably related to problems in particle clearance the findings of this pilot study suggest that although the sem and edxa will likely prove useful tools in the evaluation of lung biopsy soeclmens. the findinq of lnorqanic material in cases of uip must be interpreted with caution autopsy senicb whlch had no known expcsure to lnorganlc dust sectlons were taken from the nght lung each pulmonary adenomatosis has been described as adl~tinctive pathologlcal change seen in the lungsolexperimental area of replacement of normal alveolar lining cells by a taller more glandular type of eplthelwm. usually wlthwt significant cytological atypia. we describe cases i" whlch a smtlar change was seen as an mdental fcndlng ~n resection specimensfor primary pulmonary adenocaranoma. the lesions (usually multlpleand each mm orless m diameter) were identified in lung parenchymaat a distance from the tumour and consisted of thickened alveolar walls lined by prominent, distinctly atypical cells morphologically slmllar to type i pneumacytes and cytologically different to the associated turnour reactive changes " lung involved by obstrmtive pneumonitis were not included !n thts sews all of the associated tumwra were peripheral adenocarcinamas and all showed a pattern of alveolar wall spread at the tumour periphery clinically of the patients were female and all were smokers or ex-smokers the slgnlflcance of this lesion in the histogenesis of primary pulmonary ademcarcinoma is. as yet, unclear animals often i" assoclatlon wlth exposure to inhaled camnogens morphologically the lesion is a clrcumscrlbed departmen ofli~slopafhology sf richard's hosprlal. chrcheslei wesf sussex. po se a series of consecutive personally conducted autopsies in patlenfsdying suddenly outside hospnal and where the death was reported to h m coroner presented cot deaths were excluded. in cases the death had been reported because theattendingdoctarwas unwillingtoissueadeathcertlficate. in theother cases. deathwasnot duetonaturalcausesandfallowedsuicidearan accident oftheformetgroup, therewere t unsuspected malor findings cn patients ( . %). that s to say findings which were either the cause of death or would have led to admission to hospital far assessment and possible treatment they had been discovered in life the largest group were cardiovascular ( ) but there were cases of unsuspected malignant disease there were only five malor unsuspected findings in the group of unnatural deaths years). these results highlight the loss of teaching material in the coroner's system, at a time when hospnal postmortem rates are in universal decline. this matenal would be of value to medical students and to pathologists in traming. and largely comes from cases of lmle or no medico-legal significance. %) and these were in older sublects (mean age . a total of postwnortem specimens refened to the royal victoria ho~pdal electron microscopy unit dunng the years - havebeen reviewed thkscompnsed . vo of thetotalnumber of sewre-relatedcaaesreferred to theunit.oulof ocasesreferred. wereexaminedandrecorded indetail theremainderdidnotundergoelectron microscopic examination for various reasons, such as a concius~ve diagnosis being reached by light microscopy alone. and semi-thin sections showing severe tissue mtoiysis. the most common tssues referred for examination were lung, kidney. liver, brain and head the range of e m. studies carried out included transmission electron microscopy item). scanning electron microscopy (semi and x-ray microanalysis on sem. theelectron micrographs wererevlewedwlth respect totlssuepresewatlanand th wa correlatedwlth the timeintervalbetweendeathand autopsy. electron microscopy was considered, an review. to have been diagnostically useful in % of cases an which it was deployed c s herringtan, a k graham, k cooper, j mcgee it was shown prev usiy that the discrimination of human papillamavirus (hpv) types and by nlsh in archival biopsy matenal requires dlfferent conditions from those predicted by conventional solufion kinetic analysis the parameter tm' (tissue tm) was defined in order to describe these differences in this study. these pnnciples were extended to the discrimination of hpv . and in casesol cln the results of nlsh analysis werecompared with both immunohistochemistry for viral capsid protein and pcr typing these data demonstrate that crosshybndisatm of high nsk viral types occurs ~n clinical lesions under conventional hybridisation and stnngency washmgcondit ns thiscross hybridisation isnot due to thepresenceofviralcapsidproteinand sm relikelyto be areflectionoftheendpoint usedinn sh.i e thepresenceofvisiblesignal.practcally.multiplenlshsignalsduet closelyrelatedprobesinarchival materialarenot indicativeofmultiplehpvinfectionunlesstheyarepresent either ~n morphologically discrete areas of the biopsy or their presence has been confirmed by another molecular technique. motegenerally, thepresenceof asignal in nlsh usingaparticularprobedoes not implythattheidentity of the target nucleic acid is that of the probe medical students in the autopsy room shows that % thought faces should be covered during autopsy, and % thought that genitals should be covered. about half the group were indifferent to both proposals % thought the patient's identrty should be concealed from observers three students thought there should be no conversation at all. % thought that conversation should be limned to procedures and findings and % thought it should be limited to professional maners;conversely. % thought thereshwldbenolimitson thetopicsdiscwaed nearlyso% thoughtstudents should beencouragedtoassistbutnotpressuredintadoingsa. . % thoughtthatallshould becompelledtoassist, but % thought that students should onlv obsewe. and not be allowed to ass s.t. % of students thought that they should oni!iobsene autopsies on patients they had clerked. whereas % thought they should onlfbe on autopsies on patients they had not clerked. fmally. % thought that relations should give spmfic perm on before students observed autopsies. whereas . % thought not a quantitative study of the effects of fibroblast growth factor on wound strength and cellularity fibroblast growih factor (fgf) has potent angiogenic and fibrogenlc effects and is lmpllcated in the formatlon of granulation tissueand healing few attempts have been made toquantity theseeffectsmnvo. we havestudled arat skinwound modelusing red cellghostsasan fgfvehicle tensiometv ofthewoundsshowed amaxlmaleffectof the fgf after seven days when the wound strength was % above that m controls (p < ) this effect had disappeared by fourteen days. computerized image analys~ using a joyce lo& mini magiscan measured total nuclear content of areas i" the wounds. permitting a topographic analysis of cellulanty versus distance from the wound centre cellularlty effects showed adifferent time course from wound strength. a % increase at fouldays anda % decreaseat sevendays. relative tocontrols(bothp < . ) attwelvedaysthecellularityeffectwasstill sqnlflcant at a vo decrease but by twenw-one days it had dwappeared the results suggest that fgf causes an early transient increase in cellulardy and more rapid increase in wound strength: most of these cells are macrophages and fibroblasts suggesting a connection between thesetwo obsenat ns. the adhesion molecule$ integrin (cd ). integrin (cdl ). and intercellular adhesion molecule- (icam- , cd ) are essential to the intimate co-operation of antigen presenting cells (apcs), t cells and keratinocytes cyclosponn. which is an effective treatment for psonas~s, may cause immunosuppression by altering antigen presentation we have pertormed a quantitative immunohistochemical assessment of the effect of low dose ~yclosporin on the expression of p -integnn. p -integnn and icam-f m the epidermis in chronic plaque psonas s. staininglevels werecompared withclinicalresponseasassessed bythepsoriasisareaseverihllndex(pas score) pt-lntegnnand cam expressionon keratioocyteswerenotaltered bytherapy buttherewasasignificantdecrease inthemeanlevelsof p positivelargedendriticcells(apc )within theepidermis.bz-integnn wasnot expressed by keratinacytes there was a strong correlation between p expression and pas score after three months on cyclosporln and one month off therapy these results lndlcate that p stalnlng on large dendrltlc epidermal cells previous studies enumerating silver stained n~c l~l a r organiser regions in problematic cutaneous melanocytic lesions have yielded inconsistent. but generally favourable, resulis. it seems probable that such inconsistencies. arise largely from differences ~n fixation. staining and counting strategies. our group, having devised improved methods of agnor staining and counting. is now able to re-examine the potential role agnors in borderline lesions pilot work demonstrated potentially significant differences in agnor dispersal between benign and malignant lesions and i" this study bath agnor numbers and dispersal patiems have been evaluated a range of melanocytic lesions including banal naev~. dysplastic naevi. typical melanomas, spiu naevi, atypical spih le~ions and minimal deviation melanomas were collected virtualty all of the unusual lesions had been diagnosed by specialists in dermatopathology. using a combined assessment of agnor numbers and dispersal, it proved possible to discriminate borderlip lesions from banal naevl and typical melanomas. benign borderline lesions -such as spit naevi ~ possess numerous nor$ but display tight clustenng in contrast to malignant melanomas wherenordispersa isaprominentfeature discriminatinganetypeaf borderlinelesionfromanathercauld not be achieved however, in practice this distinction is probably less important than assigning a aoirary specimen to a benign or malignant group a larger prospective trial of agnors in melanocytic lesions currently in progress with a mean total exposure of pack-years. alihough p was expressed more commonly in adenocarcinoma ( rof lo)andsquamouscarc~noma( ~~of )than~nsmallcelltumours(l %of o),thiscouldbeaccountedfor by the smoking history. since patients with non-small cell carcinoma smoked more (a mean af pack-yeas) than those with small cell lesions (mean of pack-years). there was no relationship between p expression and sunlyal. s. a. sun. j. r. ~a s n e y depaitmenl ofpathology university of liverpol, p box , l!vemooi. l bx activation of thec-myconcqene with overexpression of itsancopratein product ocurs in bronchlal malignancles ofalltypes, but has beenmost extensivelystudied insmall celicarcmoma, wherensaverexpression~n culturedllnes has beenassociatedwithdevelopmentoffast-growingvanants whlchlosemuchofthelrendocrlnephenohlpeand aner their morphology. it has been suggested that these vanant lines might be the equlvalent of the large cell bronchial endocrine carcinomas sometimes seen i " yiyo, but this is not proven. we have used th myc - e monoclonal antibody and the avidin-biotin technique to study the panem of expression of the p oncoprotein product of the c-myc gene in turnour deposits of twelve subjects coming to necropsy with disseminated small cell carcinoma in an attempt to relate l o morphological vanablity and metastatic site. anhough expression bore no relationship to morphological vanation. it ohen differed markedly horn site to slte. whereas parts of the primary t m w r strongly overexpressed the protein ~n all but one wbfect, there was considerable vanability between secondary deposits. oossiblv mdicatina arelation~hi~ between c-mycex~resslon and propensltvfor metastasls to certain iocations. tumour growth rate is a key parameter of neoplastic aggression. and is determined by the balance of cell gain and loss.apoptos~sisama]ormadeoftumourcellloss, but linleis knownof itsregulation. dieremesin turnourgrowth conferred by hpvtypes and werestudied in aratfibroblast modelsystem. immortaiisedcells weretransfected with hpv and expression vectors. either alone or with activated c-ha-rasl. monoclonal cell lines were established. and their vector dna content was confirmed by pcr tumour cell ddfered in their growth properties m wvo and m vilro. hpv containing turnours were larger and showed less apoptosis than those containing hpv , although bothshowed moreapoptasls than thenodulestormed bytheparentfibroblastsalone. in all turnours the presence of ras greatly reduced apoplosis and increased the growth rate. very similar propetlies wereobswvedin culture, and apoptoticratesshowed astranglnveneconelatlonwlthratesofnet cellgrowth. hpvs appeared iostim"latetumourcei apoptasis, butthiswas suppressed byras. melowerapaploslsassoclatedwilh hpv compared with hpv may partly explain the more aggressive phenotype of cervical c a n m containing hpv . it hasbecomeapparent thatanumberof maleculesmaybeexpressed byrestingorqu~escentcellsandarelostwith transition into the cell cycle. in addition to being of biological interest, such molecules may prove useful as operatcanal mahers of quiescent cell populations in histological material and allow the further characterisation of cellular subpopulations. one such molecule is statin. a kd protein previously reported to be expressed only by cells ~n go. we have shown bylaserconfocalfluarescence microscopythat thestatin antigen is associated with the nuclear envelope with a dstnbution m l ar to that of nuclear lamins. using a monoclonal antibody ( - ) that recognises slatin we have defined the tissue distribution of statin immunoreactivlty in a range of continuously renewing, conditionaily renewing and non-renewing tissues. the distribution of immunoreactivlty s essentially as would be expected of a maher of quiescent ceils. in contrast. in established pancreatic carcinoma and other epithelial ceil lines. we have found $latin immunoreactivity ~n cycling cells using biotinylated k slatin double labelling techniques. in conciusion. statin lmmunoreactivitv in normal tissues correlates with auiescence but this realtionship is lost, at least in vifro. loss of cell-cell and cell-substratum adhesion are imponantfactors during turnourprogression. tumour promoten are compounds which although not carcinogenic themsews increase the frequency of turnour development in animals previously exposed to carcinogens. we have used the turnour promoter tpa on cultured human renal epithelia ~ells mimic nwplastlc transformation. following tpa treatment we haveexamined thedistribution of vlnculin, b integrin and actin within thetreatedcells byfluorescencemicroscopy.treatment of therenal epithelium by tpacauses arounding up of the cells and a loss of adhesion toeither laminin or fibronecttn substrata fiuorescent microscopic examination reveals a progressive loss of reactivity for vinculin and b inlegrin within local conlacts. these changes are accompanied by a redistribution of the actin microfilaments from orientated bundles of stress fibres to a circumferential arrangement these changesofareduction in focal contact componentsand disarganised actin cytoskeietan mimic the changes we have previously described in renal carcinoma. glutathione s-transferases are a diverse group of enzymes with an important role !n the metabolism faelgn compounds including some carcinogens and cancer chemotherapeutic agents. increased expression of gst pi is seen in manyanimal and humancancersandisassociated with resistancetocalboplatinandcisplatin in humanlung cancet cell iines. gsts are involved m steroid hormone transport and metabolism and have a role in the complex metabolic relationships between settoli ceils and germ cells. we have studied gst isoenzymne expression by an immunohistochemical method in stage i teratomata. stage i tetatomata, stage i seminoma, cases of intratubulargerm cell neaplasia( tgcn) and a groupof cryptorchid and normal testes. the stage i teratomata had beentreated with cisplatin based therapyand bothprimmaryandpost-therapymetaftatictumourtissuewerestudied. gstalphaexpression correlated with morphological evidence of epithelial differentiation in teratomata.therewas no difference in gst expression between stage i and stage i pnmary testicular iumours nor between primary testicularand post-therapy metastatic tumours. gstexpression did not correlate with survwal. gst pi was strongly expressed in the neoplastic germ cells of itgcn but was weak or negative in normal germ cells. this may be significant in view the potential for later contralateral turnour development tn patients treated by cisplatin based therapy. in summary, gstexpression in testicular germ cell turnours reflectedlheirdifferentiationandappsared to be unrelated l o therapy and subsequent survival. a case illustrating the usefulness of electron-microscopic examination of fine needle aspirates s described fine needleaspiration wasperlormedon asubcutanwusnoduleinthechsstwall ofaneldedyman whowassuspmted tobesuffering from bronchogeniccaronomaanclinicaland radiologicalfindings.mesmears werenotdiagnan~c. but ~n view of the history af asbestos exposure, the needle washings were submined for eiectron-microscopic examination, which showed mesathelial differentiation and a diagnosis of metastatic malignant mesotheiioma was suggested. an autopsy perlormed eight months later confirmed the diagnosis of malignant mesotheboma. t. dorman, a ti ismail, b cunan, m. leader pathology department hoyal college of surgeons m ireland, dublln many histologi~allymalignantfeaturessuch as hypercellularltyand high mitoticcounts, but clinically followa benign course. demoid tumours and fibromafoses can be densely cellular but usually quite bland histologically and are associated wdh infiltrative marginsand repeated local recurrences. in addtion irradiated tissue often contains many cellswithcytologicalfeaturesol malignancy thisstudyexaminesthepioidyofsuch lesionsbyboth olthecunently availabletechniques using dlsaggregated formalin fixed paraffin embedded tissue sixteen cases nf, dfsts. dts. fibromatoses( nci~ding palmar, plantar. retrcpentonealand soh tissue) and miscellaneouscases including leiomyoblastoma, inflammatory pseudosarcoma and plexiform histlocytama. one malignant fibrous histimytoma and t paragangliama were included as possible poslive controls. cases containing numerous irradiation fibroblasts were alm analysed all 'pseudosarcwnas' were euploid by both image analysis and flow cytometry the malignant fibrous histiocytoma was aneuploid by image analysis and flow cytometry and the paragangliomawasaneuplaid by imageanalysis and tetraploid by flow cytometv themndwonsof thisstudv are that pseudosarcomas are euploid and aneuploidy would appear to be confined to malignant tumours of mesenchymal ongm. other sites we collected cases to study their morphology and antigenic profile after staining for the epithelial markefficam . flow molecularweight keratins)andlp (high molecular wefghtkeratins), thegeneralmelanoma and neural markers s w. neurone specific enoias (nse). protein gene product . (pgp) and the melanoma s w i f l c marker hmb , the neuroepithelia markers leu and glial fibnllary acidic protein (gfap), and the intermediate ftlaments vimentln vim) and nsurofilament (nr the following turnour types emerged: i pure spindle celltype(n = ), ,purepolyganalcelltumaun(n = )comp sedofpleomorphiccells(n = ).undwmcellswith an alveolar in = ) or sheet arrangement (n = ). and mixed pleomorphic spindle and polygonal cell tumouffi (n = ). two tumouffi proved to be lymphomas an intraepithelial melanocytic component could be established in cases within adjacent respiratory or squamous melaplastic epithelium. in cases n e w trunks a~mclated with tumour contained increased numbers of atypical schwann cells. consistently expressed were s w ( and colon ovanan tumour antigen (cota). csais a heat stablemucin associated antigen present m normal colonic epithelial cells and is expressed m greater quantitites tn colon c adenmarcinomas cota is a heat stable antigen present n colanicneoplasiaand mumowwananturnours but not ~nnormalcalonicepithelium psmionsfrom to primary adenocarcinomas ( ovanan, colo-rectal, gastric. breast. oesophageal. prostatlc. pancreattc, endometnal and t gallb adder)and two adenmarcinomas metastatic tothe liverwere incubated with anti csa and anti cota antibodies using the p.a.p technique with positive and negative controls. while anti csa positivitywasseenin d ~ loni~aden carcin ma~( wsakand lo~tr~"g).~twasalsaseen n af ovanan ( weekand strong). of gastric ( weakand strong), of breast( weak and strong). of oesophageal ( weak and strong). of prostatic (all weak). of pancreatic ( weak and strong). of endometnal (all weak) and the gallbladder ademcarcinoma. both liver metastatic adenocarcinomas were negative in additim while anti cota staining was positive m of colonic adenocarcinomas ( weak and strong) and of ovarian adenocarcinomas.~twasalsapositive~n af gastnc( weakandzstrong). al lobreast( weakand stlmg), all esophageal(allweak) of pr ~tati~(allweak), pa"creatic(low akand t strong),all endometnal(all weak) and the case of gallbladder endocarcinoma (strons) both liver metastatic adenocarcinomas were negative. m e c o~c~u s~o~ of this study s that anti csa and anti cota are not adequately specific in the identification of a ~olonlc or ovarian origin af an adenocarcinoma and cannot reliably be applied to the identification of a metastatic adenocarcinoma of unknown primary site the early intfa-epithelial changes of adenocarcinoma of the nose and paranasal sinuses were sought in the histological sections from high grade adenocarcinomas and cyclindric cell carcinomas from this region none of which had had previous radiotherapy treatment. of the cases came from wwd-workers. i r m a polishw and from a plasterer attention was directed to the non-neoplastic epithellm of the surface and of the sera-muclnous glands. in casesofadenocarcinomaand t of cylindriccell carcinoma surfacechaogeswere detected these took the form of hyperplasm of goblet cells of the rsspiratq epithelwm accompanied by dysplastic changes of other epithelial cdls: m the latter these became enlarged and irregular m shape with large hyperchromatic nude and prominent nucleoli. lnthsmajorityof iesionsthesechangeswereisolatedin theepithelium. changesof thesetypes were never seen in sefomucinous gland epithelium. in addition similar changes were not noted m the coverlng respiratoryepithelium in casesofnasal polyps evenwhen severely inflamed thesemsmchangesso described haematoporphynn derivative sensitised tumwr tissue. s underwing evaluaton " widesptead and m i s t a n t papillary tumoursand ,n casesof rvldespreadseveredysplasra/carnnoma,nufu". arewewof spallents ( , m, ages years. pre-and post therapy) has revealed no change ~n histological grade and stage ~n patients, progression to invasion in t case and a redunion to mild urothellal atypla alone in case. local tissue changes following treatment were limnedto oedema and an acute lntlammatoryreactlon. no metaplastlc. stromalfibroblast or nene changes were seen variations ~n bladdw size. both increased and dlmtnlshed, wwe encountered the histological grading of tcc following pot shows little improvement. but studies are in progress to improve lhght delivery in the bladder, and hence improve treatment outcome. a armour. a. s. jack bpartment of pathology, umverslty of i d s . leeds. lsz jt follicular lymphoma is a disease characterised by widespread lymph node involvement usually at the time of presentation proper tie^ of lymphocyte homing and circulation appear to be mediated by a variety of cell adhesion molecules. the a m of this study was to compare mrmal nodal lymphocytes with the neoplastic populatlm m folllcularlymphoma. hall casesalloftheneaplasticfolliclesexpressed lcaml but thisyanedfromonlyafewcellsto % of cells within a follicle this dosltlvlty included dendmic retic~l~rn cells. germinal centre cells and neoplastic lymphocytes. leu (lymphocyte homing receptor) showed an inverse pattern of expression in all but cases the phenotype (icaml + l a is a feature of germinal centres and scanwed paracontcal blasts in reaclive nodes anhough there is uniformity of expression of lcaml in germinal centres which is not apparent in any folllcular lyumphoma case. this study showed a loss of lcamt and increased leu expression by neoplastic lymphocytes within follicles this may relate to the propensity of this disease to spread widely throughout the lymphatic system activation leads to an alteredfunction and aconcomitant aneration in the chemistry ofthe cell surface, which ~s a site nchincalbohydrate. rat lymph nodelymphacytes.eitherunactivatedoraclivatedfot days namixedlymphocyte reanlon. were treated with biotinylated iectins from a large panel. chosen to probe surface glycans lmlns were revealed with awdln-phycoerythnn and cell populations were analysed in a fluorescence activated cell-sorter double-stalnlng with fitc-monoclonal antlbodles defined the functional lymphocyte subsets. uea- and lta boundtonoce sof anytype. unactivatedb-cells boundall theremaininglectins.savempa. toagreaterextentthan did unactlvatedt-cellsand b-cell actlvatlon produced no change i" glycan sxpresslon. unactlvated and actlvatedtcells all expressed a . -linked sialyl residues, but o- . -linked sialyl expression was hetrogeneous, did not conespond toanysubsets,and wa~unchangedonanlvatian.alargegroupof lectinsshowedlow wno binding to unactlvatedt-cells, but boundonactivatianexactly in parallel to l- receptorexpressionctacantigen) thes~ngle structure gawgalnacal, galpl, glcnac-) galigalnacat , gal~ , glcnac-) r could account for their binding the presence of neoplastic (light chain restricted) b cell follicles in low grade b c d gastrolntestlnal (gi) mmphomaof although malt s not present in normal human gastric mucosa, lymphold t,ssue ,* acqulr& ,n response lo colonisationof themucosa by helrcob~cleipyioii. we have investigated the possibility that thisacquiied lymphoid t ssue is malt type whlch may povovlde the background i" which lymphoma can arl~e w e examined gast,,c b~ p ~e fr m casesafh~l~~~bacrerassaciatedgastntis.and biopsyandresectionspecimensfram casesaf t"mours has contr'buted to a peklstence of ihe more "iew arefofl'cle (fee) gastrlc cell lymphomaof malt in cases of hebcobaciergastritis prominent lymphoid follicles were idenfilled in aftheee b cell clusters were identified within the gastric epithelium. reminiscent oltheteatures seen the dome epithelium of sm intestinal peyer'z patch. thls b cell-eplthelh assoclatlon was not assaclated with the eplthelm cell changes or the glandular destructm seen ~n lymphoeplthellal i~s i o~s malt lymphomas. in / cases of gastric maltlymphoma helicobaclercould be identified oltheremaining cases. weiegastrecfamy specimens n which specimen washing may have contributed to the negativefindings. we suggest that gastrlc malt s acqulred in response to local immunological stimulation as a result of mucosal colonisation by hel~cobaclerpylon, and that the development of malt lymphoma is a subsequent event mucosa associated hmphold tlssue (malti has been explained on the bas of speclftc colonization at reactlve b lymphomasbe included in lhecatqoryof malt lymphoma but the frequent presenceotafoll~culapanem in these foltic es by ihe neaplastlc cantrocyte-like(ccl) it has been low grade cellthyro d we have ihe and 'nvesflgated the and genotype of' of primary low gradebcelllymphomasofthethyro'd alsodemonstratedfeatures ofmalt ymphama'ncludtng cclce and l y m p h~' t h~' a l l~' o n s . m e appearances and immunoh'sto'og~ofthefo'l'c'es werethoseofto'l'cularcalon'rat'on described 'ngimaltmmphomarather thanfccfoll'cular'~mphoma thepredomlnant iwg ~ane'naffoll~cularcolon~rat~~nconformed tothatdeslgnated cclcells show'ng astnk'ngly h'gh prol'feratton late no evidence of the '( ' ' transiocat'an was found in any on dna extracted from fresh (n = ) or paraffln embedded (n = ) t sue mese ilndlngs argue against a fcc lineage lor primary thyroid lymphomas and support their ncius on " the maltcategory we have used a panel of antibode$ to demonstrate stages of granulocyte maturation by immunoh~st chemistry i" decalcified. wax-embedded bane marrow trephine biopsies antibodies reactive with muramidase. u- -ant~tryps,n. neutraphil elaslase and cd react with early granulocyte precursors. cd and calgranulin identity later granulocytes. ihdiylduai antibodies dlfier rn the populations cells ldentlfled theantlbcdtes also react with cells of mwlocyte llneageand provide information concerning theorganlsatlon afmonopolesls about whlch liffle is known in normal marrow granulopoiesis is zonal wlth maturation occurring radially around trabeculae and blood vessels this pattern is exaggerate in reactive hyperplasia and chronic granulocytic leukaemia (cgl) there is marked disruption of this zonal organisation in myeloproliferative and myelodysplastic states, with considerable overlap patterns between these conditions in chronic mvelomonacvtic leukaemia (cmml) there is comdlete absence of zonal arrangement of granulopaiesis possibly due to monocytic proliferation obscuring the underlying marrow spaces and that by analogy with cgl cmml represents an exaggeration of this normal panern granulopoletlc panern we hypotheslse that monopoles normally occurs " a randomly dispersed fashion within as previously shown by us in animals untreated with cyclosparin a. the cell birth rate fell from an initial prelmmunisation valueof cellsit cellsihour to cells celldhouran day followed by a rise cellsit cellslhour on day however, in cyclosporin a treated animals the cell birth rate ( ce s/looo cells/hour) was signiticantlydepressed belowthecontrol pre-immunisation levelandremained suppressedupfaday fallowed by an abrupt rise these results are consisten with the hypothesis that t lymphocytes or their pmducts not only drwe the morphological appearancescompnsed lymphoepithelia lesions (in onecase). numerous lymphoid follicles and a diffuse infiltrate of monocytoid cells (centrocyte like cells) there was striking plasma cell differentiation and colonisation of lymphad folllcks by monacytoid cell?. and neoplastic plasma cells. lmmunahistochemistry convincingly demonstrated heavy and light chain restriction ~n all the biopsies from both the cases the bladder s developmentally related to the hind gut and this manifests " the variety of metaplastic epithelium seen at this site circulating cells. encauntenng the endothelial surface. make contacts an environment rich n glycan. a large panel of biotinylated lectins was used to probe far variations in the glycans expressed on endathelia of artenes. veins, arterioles, venules. capillanes, high endothelia vessels and lymphatics m a range of normal and pathological human tissues formalin-fixed, paraffln embedded specimens from the files of manchester royal infirmary and manchester royal eye hospital were used lectins were revealed with an avidin-peraxidasesystem no differences werefound between arterial. arteriolar, venular, veinous or lymphatic endothelia all expressed abundant complex-type n-linked glycans. of several subtypes. capillaries were highly variable and showed heterogeneity ~n their expression of ) outer chain sequences fn n-linked glycansand ) mucin-type sequences. both between different. normal organs and within an organ. implying that the surrounding tissue probably had a regulatory effect where endothelium was reactive, additional aneratiom wete seen and actively growmg endothelium in granulation tissue expressed hqhmannose siwctwes. high endolhelial ybss~is showed a much lower density and narrower range of glycan expression thandidad~acenlnormalcapillanes,despite theirknownvery highrateofglycansynthesisand secretion the iymphocyteiln-homing receplor(s) would be a component of this restricted porlfolio of glycan monocyte margination in atherosclerosis associated with immunological injury n. j. combs, p j gallagher. p. s bass clinical and experimental evidence indicates that immunological injury is associated with accelerated atherosclerosis. allograft recipients may develop accelerated atheroscleross and anlmalsgwen serum sickness and a high fat diet develop more extensive atherosclerosis than controls fed the diet alone we have tested the hypothesis that atherosclerosis ~n these animals is associated with increased adhesion of monocytes to the aortic endothelium. chronicswum sickness was induced in genetically hyperlipidaemic rabbits with nativeanionic bovine serum albumin (nbsa) or highly cationised protein (cbsa) rabbits given nbsa showed a spectrum of glomerular endocapillary prollferativechange thasegiven cbsa developed early membranous glomerulopathy in controls the numberofmanocytesadherent to theendothelium ranged from y isq mm and m animals given senm sickness - isq mm (nbsa)or ~ isqmm(cbsa) as therewerenosignificant intergroupdifferences theseiesultsdo not support the hypothesis that immunological injury increases monocyte adhesion to the aortic endothelium departmen! oipatholcgy, souihampfon universily hospdals soufhamplon. so xy amyloidcan beidentified nupta % of elderty hearts,especiallyin themyocardium oftheatrialappendages in a small prapanionofth%lecasesamylaid~salsapresent inthecardiacvalve~ butisolatedvalvulardepos,tsarerare. a year old male presented with bilateral leg weakness and an intradural turnour. he died days after spinal surgeryanddeposltsofaglloblaslomamunlforme wereldentlfledm both thecerebrumand thecord therewasno hlstory of cardiac disease but all four valves showed translucent verrucous thrkenlngs. these had a uniform eosnophilc hi~lologicalappearance. stainedwithcongo red and had ultrastnucturalfeatures of amyloid amyloid p proteinwas identified immunohistochemically ~n thesedeposits but negative multswere obtained withantibodies t o m . aland a proteins. mwewas no evidenceof cerebrovascular ormyocardial amyloidosis the involvementof all four cardiac valves. the stnking absence of amyloid i" other organs and the a~~o~l a t i o n with a widespread glioblastoma are unusuai but unexplained features of this case the ima s used as a bypass of narrowed coronary arteries it is said to be less prone than vein gratk to develop subsequent occlusivedisease. thisstudy of pairs of imasfrom subjects of vario~s ageswas to see if the histological structure ofthevessel might explain immunityto graft disease sixteen imas were fixed in distension byformalln at cm of water pressure and secllon~ taken at the level of each nb (first to stxth). all arteries undelwent similar changes along thelr length. wlth no slgniflcant difference between left and rlght lntimal changes were minor conssting of lbro-elastic muscular thickening in all age groups and in those who had died from vascular disease theinternalelastic lamellawaswelldefinedat all levels. buttheexternallamellawasclearlydefinedonlyatthelevel ofthefmhandslxthnbs(i.e. mthedetalartwyj. medral thrckne~~decreasedalongthelengthofthearter~esandinall cases changes from an elastic to a muscular structure, generally at the level of the fourth or fiflh rib the ratio of medial thickness to numbw of lmdlae in~reases along the artery. notably at its point of change muscle fibre orientation changes from inter-mixed cir~ular and longitudinal in the elastic part to predominantly circumferential in the muscular pan. the pronounced stnctural differences of the ima compared to the similar sired epicardia coronary arteries. which are muscular. may be of relevance in explaining their markedly different incidence of atheroma. naomi carter, s variend fatty change of the heart is a pwrly defined pathological entity which in the adult hean can be caused by severe hypoxia nutritional disorders. poisoning by selected drugs and catecholaminerelease it is most commonly seen ~n association with coronary artery disease little data exists with regard to the paediatric hean in pediatric deaths coming to post mortem over a year period. there were cases of myocardial fatty change of varying seventy and distribution detected ~n red -stained sections infection and congenital disordenwereimplicated n deaths infectionand cangenitaldisordenwereimplicated n deaths. fallcasesoffarmchangeand % of all deaths seven ofthe e~ases had acombined ~nfecti~u~andcangennalaetiology othercausesofdeath included turnour, traumaandcomplicationsofbinh % of casesofsudden infant dealhsyndrome(s ds) hada fatty heart only one case of the total cases showed deflnlte hlstaloglcal evldence of ischaemic myocardial damage. insome instances, thedegreeaffattychangemayberelatedtathedurationand seventyoftheundertyng condition someofthesechildren may havean occunnutritional orenlymedisarderwhich ~sexpressedatacellular level ~n the form o! fatty change and that contibutes to their early death. we havecampared thereparting oftemporalarterybiopsiesbetween - ( cases)and ( cases). the overall incidence of positive biopsies was . % and . % in each period the number of patients with clear clinical evidence of cranial arteritis was % but in - , % and in , % of these had positive temporal artery biopsies when the histology was reviewed approximately % of biopsies in each period had been erroneou~ly reported as healed or atypical artentis. in contrast. a true histological diagnosis of artetitis was missed in~nlya~inglepatient approximately % ofall patientswithaclinicaldiagnosisof giant cellarteritisdevelopedan additional symptom or pathological change associated with steroid treatment frequent final clinical diagnoses in patients with negative temporal artery biopsies were transient ischaemic attacks, cerebrovascular accidents. unexplained headache or migraine. polyarteritis or polymyalgia hwmatica these resuns confirm that one third of patients with deflnite clinical evidence of cranial arteritis will have negative biopsies pathologists continue to misinterpret normal arterial ageing changes as evidence of healed or atypical arteritis depanmenf hslopafhology and depanmen! resp raioiy medicne. sl banholomew s hospifal london ecla /be adhesion ~fepithelium to extracellularmatrices is mediated partly byafamily of heterodimeric molecules known as megrins we haveexamined theexpressionofthealpha- toalpha- integnn subunits in~unured human branchia epithelial cells. and in bronchial biops es from normal subjects and atopic asthamtics. we have also studied the expression of intercellular adhesion molecule-t (icam- , cd , rhinovirus receptor) br~nchialepithelia cells from surgical specimens were grown as explant cultures on glass covenlips. bronchial biopsies were taken from right upper and middle lobe carinas lmmunastaining was performed on acetone fixed cells and frozen tissue sections using alkaline phosphatase and immunoperoxidase techniques all biopsies showed strong positive staining of epithelium for alphe?, alpha~ and alpha- integnns. stainingforalpha- was weak or negativeand epithelium was negative foralpha- and alpha- except in twoasthmaticswhere itwas weakhi alpha- posittve. in contrast. cultured bronchial epithelial cellswere positive for all these mtegrinsexcept alpha- . epithelium was positive for cam- in asthmatics but negative all other biopsies. cultured cells were ~trongly positive for this molecule it concluded that expression of some adhesion molecules bv bronchial emhelium may vaw ~n relation to the cellular environment and that ths may be imponant tn disease l%panmen! of pafhology univeisrty of birmingham and departmenrs o! 'immunology and his!opalho!ogy eas! blood eosinophils are in a relatively inactive state with migration into tissues eosinophils became more activated these activated cells are hypodense compared to most blood eosnoph s low affinity receptors for both ige (fcerii. cd and igg (fcgriii. cd ) have been documented on activated. hypodense eosinophils this study assessed the expression of these proteins on ttssue eosmophk derived from nasal wlyps the blood and nasal polyps of seven patients undergoing nasal polypectomy were studied nornodense and hypodense eosinophils were isolated from venous blood by centrifugation an a discontinuous percoll gradient percoll wassimilarly used to obtainaneosinophilrichpreparationfromcellsearactedoutafthenasalpolyps cytospin preparationswetem?.de of these samples frozen sections of each polyp were also prepared lmmunostaining using an alkaline phosphataseianti-alkalln phosphatase detection system demonstrated that neither blood nor nasal polyp eosinophilsexpressed detectablecd orcdt l t i~p~~~i b l e t h a t e~~i n~p h l l s o f na~alp~lyp~aresimilarto blood eosinophils and are in a relatively inactive state lavage and biopsy studies. the relationships between mucosal inflammation. bronchospasm and bronchial hyperreactivity are unclear since bronchial smwth muscle has an essential rolein the pathophydogyof asthma. we haveexamined theextentto which it ~sin~ol~ed~nallergicinflammation bi p~ie~in~l~ding~moothmu clefr m asthmatics (age range who did not use steroids and controls (age range ) were embedded i" araldite and stainedformast cells ~monocional ant bodyaat)and eosinophils(monoclona antibody eg ). mast cell numben in the lamina propria and smooth muscle were similar for both asthmatic and control sub e m (mean values. astham lamina propna mm . smooth muscle timm , controls /mm and blmm'respectively) eosinophil numbers in the lamina propria were increased -fold in the asthmatics (p = ) but there was no significant increase in the number of eosinaphils rn the bronchial smooth muscle eosinophil numbers in the asthmatics correlated positively with fev, we conclude that the role of muco~al inflammation in the pathophysiology of asthma has yet to be determined we present cases ofextra-pulmonary pneumocystosis diagnosed on routinesurgical specimens (two biopsiesof liver. and one each of gastric mucosa. small intestine and a pen-anal mass). in each case. the histological features were similar to those seen in the lung. and as in other material from cases of aids, munlple pathology was often found extra-pulmonaly pneumocystasis s now being reported from a widerangeof clinical specialties. one reason forthismcreasemaybe that impraved patient suiyiy~i with"topicay inhalatiooalpentamidinetherapyallowsvisceral foci of infection to become clinically apparent this hypothesis is supported by the finding that ofour cases were taking nebulised pentamidme inthesehypersensltivitystatestothatobselved mbronchiectasis (a chronic suppurative lung condition not thought to in~olve a hypersensitivity aetiology) and in smokers and non-smokers with no evidence of active pulmonary inflammation ourresuns havesh wnthattheb:tlymphocyteratioisnodifferent ineaaandsarcoidfromthat seen ~n bronchiectasis and in normal lungs. we believe that thns is further evidence to suggest that bal is an unrepiesentativetechnlquetorlhestudyof interstitial lung diseasesand that morecansiderationshould begivento the possibility of humoral immune components in the pathogenesis of ea determine its effects upon the number of pulmonary neuroendocrine cells and their peptides. in one experiment. the concentration of noradrenaline in the lung was estimated by chromatography, and that of the peptides bombesm. neurotensin and caknonin gene-related peptide (cgrp) by radiaimmunoassay. there was significantly ies noradrenaline and bornbesin in the lungs of test rats than in controls but the levels of neurotensin and cgrp were unchanged. in a second experiment, pulmonary neuroendocrlne cells in histological sections were labelled with antiserato bombesin,calc,t~tonin,cgrp andproteingenepmduct [pgp s ) a~d =~" " t~d . t h~, e w =~ nochange in the number of labelled neuroendocrine cells expressed per unit area of lung or per unit length of airway between test and control rats for calcitonln. cgrp or pgp . . bornbesin-containing cells could not be identified in either group. an increase ~n pulmonary neuroendocrine cells could not be identified ~n either group. an increase in pulmonary neuroendacrine cells immunoreactive for bomtesin and calcitonin occu~s in the early stages of plexogenicpulmonaryartertopathy in man. theabsence ofsucha change in monocrotallnedulmonanhydertenslon in the rat suggests that this is a poor model for the human disease. this preliminary study was carried out to assess the quantitative expression of neuroendocrine and mast cells in adun humanlungsofcasesofasbestos-relateddisease.tencaseseachofasbestasis, pleuralplaques,carcinoma and mesothelioma were studied m comparison with ten normals wnh no history exposure to asbestos the lung sections were stained lor neuroendocrine markers neurone specific enolase (nso and chromagranln. and, chloroacstate esterase and toluidine blue for mast cells. there was a notable variation in the number of neuroendocrine and mast cells between the control and asbestos-related disease group the variation was also seen between the various asbestos-related diseases. though not statistically sbgnificant. the trend of the vanatton indicated that the individual diseases follow a particular pattern a n lhe expresston of these two cell popuipioiis. we have studied fibrous turnours of the pleura using morphology. immunohistwhemislry and eleclronmaroscopy. the findings were compared and contrasted with reactive pleural fibrosis and desmoplastlc mesothelioma. the fibroustumwn hadarangeof histologicalappearancesand % weremalignant ~nnature.theimmunophemtype was uniform and consistent with positive staining for vimentin and alpha m w t h muscle actin. this was ~n sharp contrast to findings ~n reactive pleural fibrosis and desmoplastic mesothelioma. uttrastructural appearances of the fibrous tumours of the pleura were supportive of a myofibroblastic ongin. we propose that fibrous tumours of the pleura arise from the submesothelial myofibroblast. the malignant fibrous turnours have a distinct immunohistochemical profile and electron microscopic features to differentiate l from themalignant mesenchymai mesothelioma. a study was undenaken to evaluate the use of immunaperoxidase stains on paraffin embedded tissue lo define the cell type in routine lung cancer preparations. and in particular to identdy a subgroup of turnours showing neuroendocrinedifferentiation. forty lour consecutivethoracotamycaseswere selected. following apilot study of cases. to assess digestion times and potentiaily useful antibodies, the remaining cases were processed using a battery of monoclonal antibodies: cytokeratr iaelfae , e ). neuron specitic enolase (nse), chromogranin. and beta microglobulm. in addition to the carcinoid turnours and oat cell carcinoma in the study large cell carcinomas and adenmarcinomas demonstrated positive neuroendocrine markers. uiirastnrcturally, dense core granules could be demonstrated in only / of the large cell carcinomas and in / adenocarcmomas. the discrepancy between the lmmunoperoxldase staining and electron microscopic features likely reflects the heterogeneityafthese turnours. in thisstudy noneof the turnours co-expressed neuroendocrine markersand beta microglabulm. however, the staining panern was inconsistent m the remaining cases high molecular weight cytokeratin ( pe ) was stronglyposltlve m all casesot squamouscell carcinoma and negativein everyihing else. in summary. monoclonal nse and chromogranin appeared to provide sufficient information to identity neuroendocrinedifterentiatian in thecasesexamined m this study. high molecularweight cytokeratinwasfound to bea usefuldiscriminaforforsquamouscelicarcinoma. beta microglobulin was negativemall the turnoursshowing neuroendocrine differentiation. butthe inconsistency of staining m non neuroendocrinetumaurs. made it less helpful for routine laboratory use although endocrine differentiation is the essence of small cell carcinoma of the bronchus, its occurrence m other morphological ("on-small cell) types of bronchial umour (large cell, squamous and adenmarcdnoma) is welldescribed however, its prevalence m such tumours is uncertam, estimates differing from study to rudy and accordingto how it is sought. we have examined, byimmunalabellmg, expression offiveendocrine markerproteins (neuron-specific enolase (nso, protein gene product [pgp) . , the bb isoenzyme of creatine kinase [ck-bb), synaptophysinands-t protein)m bronchoscopictissue biopsiesof "on-smallcellcarcinomaandasse~ns varlabiltty within and between tumour deposits in subjects coming to necropsy with disseminated disease exactly half of the tissue biopsy specimens immunolabelled for one or more marken; one for four, four for three, twenty lor two and five for one. possibly indicating an element of endocrine dtflerentiation inapparent from their morphology. expression was even more prevalent amongst the extensively-sampled turnours at necropsy, but since theinlraductionofthenewgeneralpractnianercontractmapril ,lhere hasbeenasignlicant increasein thenurnbeisof skin biopsiesreceivedmhistopathologydepa~ments in ourdepartment there has beenalhreefold increase in numbers of general practiioner skin biopsies. the aims of this study were to crnlcally appraise these biopsiesand comparethem tosimilarlysized skin biapsiesreceivedfrom hospital in-patientsviageneraland plastic surgeons for the six months prior to and aner st april . data collected included numbers received, ranae of pathological diagnoses, quality information supplied on the request card, accuracy clinical diagnoses. adequacy of excision, age. sex and sites of lesions the resuns showed a similar range of pathological diagnoses. the quality of clinical information supplied was comparable in the two groups as was the age and sex of patients. general practitioner biopsies were less common from the face. clinical recognition of lesions was somewhat less accurate amongst general practitioners than amongst hospital surgeons inadequate excision was more common m general practnioner cases. . % of general practitioner lesions were found unexpectedly to be premalignant or malignant (eight cases) and all ofthesewere inadequatelyexclsed. important implicationsemergingfromthisstudy are dlscussed an audit of skin biopsy specimens from general practitioners in grampian region: changes in requesting practice and specimen type assessmen ofresaction marginsofsurgicalspecimens~s becming more important m manyfieldsofpathology. we wwe interested in developing a means of assessing surgical margins dermatopatholqy in both conventionally removed skin ellipses and m skin ellipses removed during m o~s chemasurgery technique many of the conventionally used markers, such as indian ink, alcian blue and tipp-ex correction fluid are difficult to use i" that they are messytoapply. slow todry and show insufficient contrast wiih oneanother both lnthegrass specimen and microscopically weused"superman'paint,apaintusedforresinandplastermodels,wh,chcomesin awlderange of colours. the paint was easyto apply. did not run and dried qwckiy becauseof the variety of colours avaciablewe were able to apply contrasting c~l o u n to the vertical and horizontal axes of b x c~s i o~ af skm elhpses removed by mohschemosurgery.thepaint proyidedagaodmarkergrasslyand wasnot affected byfreezinqthetissue model paint provides another marker lor surgical bxcij~ii margins and s particularly useful for moh's chemasurgery where horizontal and vertical axes aremarked inorderto assess theadequacyoftheexcision.the model paint may also be usetul in other branches of surgical pathology where resection margins are important. in psoriasis there is altered bpidermal dmwentiattan and increased epidermal turnover both which involve changesin intercellularadhesion aquantnative imm~noh~st chernicaicompa~~sonof the expression of the integrin staining of integrin p subunlts - and e subunlts were disclosed usmg an andin-blatln peraxidase techn!que. large epidermal dendritic cells (antigen presenting cells) expressed p- the psoriatic skm.showed increased - a- . - . and p- expression. buta- and~- showednosignificantdifferencefromnormal thelncreasedlntegrin expression by keratinocytes seems to be a reflection of awered epidermal differentiation rather than increased keratinoc ?turnover themcrease in ~- pasitivedendriticceilscauidbeareflectianofalteredanfigen handkngin psonatic gin. interleukm- (il- ). used m the treatment of patients with metastabc disease fallmg to respond to conventional treatment, can induce regression oftumwrbuik in certain patients however, the systemlc administration ofil- is associated wlth a number of toxic effects, including dermatological compl#catlons. these have been poorly documented we have prospectivehl studied the dermatological reactions patlents treated wiih il- for metastats cdorectal ~arconoma. pre-and past-treatment biopsies were obtwied where possible, arid sect~ons stained with h e. giemsaand pas: hesh tissues weresubjected to immunophenotyplng.there were femaleand malepatients.with treatmentrour~sranginginnumbwfrom - onlyonepatlenthadapastmedlcal hlstoryof any skm complaint (eczema]. four patlenk suffered a dlffilse eryihematous reaction with mild desquamatm and dryness the other patient developed genwalised erythroderma and an additlonal photosensitivity type reaction. histology, anerthe initialcourse. revealed patchyspmgios s. exocylosis and basal layer epidermaldamagewnh a mild perivascular chronic inflammatory cell mfinrate. with suhsequent treatment there wasthlckenmg ofthe epiderrnls, plgrnent incontinence. dermal oedema and more marked chronic perivascular cell tnfiltrate. immunohistcchemistry revealed markedchanges inthe expression of cdt , hladr, cam- and cd m the derm . these changes were greatly heightsnedwith subsequent liealmsnts. with addnional changes tn other t cetl markers. clearly il- enhances the parenchymal expression of antibody-dependent and antigen-mdependent accessory molecules which are important in focusing the immune response. claire m. thornton. maureen y. waish weexaminedalfprimarycutaneaus malignant melanomas seen rn thmdepartmmtover a fweyear permd from to . the total number of tumoum was the number of cases of malignant melanoma both invasive and m m u , increased from cases ~n to cases in . superfmal spreading melanoma was the most common type of melanoma, accounting for % of the total cases this was followed by the nodular melanoma whch accountedfor %. wlthlentigamallgnamelanomaandacral lent~g~nousmelanomabe~ngtheleast commontypes. therewasanincreasmg numberoftumourspresentlngwlth abreslow'sdepthoflessthan t mm. thefigurermng from % of cases in lo % of cases m most cases were stdl of ciar*e kud ivanhough mcreasing numbersaftumour presented at clarke levels i and i wlth a corresponding reduction ~n cases presenting at clarke level v more lesions presented with a flat cross sectlonal profile ~n the later years the figures lncreaslng from % in % in i w o he numbor of i c s~s showlng surface ulceratan at predentatlon d-reased from % in t to % in themltotlcact~vlty. thedegreeofp~gmentalion. theintensiiyofthe~nnammatorycell infiltrate. the predominant cell type and the mcldence of vascular lnvaslon showed no change over the study penod classification of benign vascular tumours is notoriousiy difliwlt and ciinicopathological correlation is often imprecise. this almost certainly reflects the tendency of pathologists to lump together different lesions under the broad heading 'haemangioma', sometimes with capillarylcavernous subtyping. twelve cases of a distinctive subset of cavernous haemangiomas, to be known as sinusoidai haemangioma, are presented. these presented in adws ( female. male. mean age years. range & ). five arose in the upper limb and five an the trunk (of which two developed in mammary subcutis). all were solitary and presented as a bluish cutanmus swelling up to . cm in diameter of variable duration. one case was associated with ipsilateral gynaecomastia. average foiiow-up of . years has revealed no tendency far local recurrence or metastasis. histologically these were subcutaneous/deep dermal lesions with a lobular, sieve-like appearance and focally ill-defined margins. they were composed of dilated. thin-walled interwmmunicating vascular channels with a pseudopapillary architecture. thrombi were common and two cases showed central infarction. vascular spaces were lined by monolayered endothelium which was often plump and hyperchromatic but n d mitotic. distinction from wnventional cavernous haemangioma and angiosarcoma (particularly in breast lesions) is discussed. current methods forthe identification of herpes simplex virus (hsvj may fail to identity the presence ofthe virus in biopsy or autopsy material. we have investigated autopsy cases and neurosurgical biopsy cases of clinically suspected herpes simplex encephalitis by a nested polymerase chain reaction (pcr). dna was extracted from routineiyprocessedand paraffin embedded material byproteinase k incubation, phenol chloroform extractionand ethanol precipitation. nested pcr was performed using known oligonmlaotide primers designed from the hsvtype giywprotein d gene from an area with the lowest homology with hsv type . two of autopsy c a w and of neurosurgical biopsy cases ware pcr positive for hsv. the third neurosurgical biopsy case was not confirmed by pcr as being due to hsv. such primers lo hsv allow the rapid retraspeaive diagnosis of herpes simplex encephaliis and should be of value to neurapathologists. tissueembedded at lowtemperaturein lowicryl-k m resin hasbeenshownto besuitableforimmunogoldlabellng of cellular antigens, and to be capable of withstanding the processing required for hybridization of nucleic acid probas.theaimofthisstudywastoestablish basic wnditionssuitabieforhybridirationof digoxigenin labelled dna probes to lowicryl embedded material, both at the light and electron microscope levels. cultured haemopoietii cells wereembedded afler brief aldehyde fixation. digaxigenin labelled whole human dna or plasmid per (negative control) probeswereappliedtothinandsem-thinsectionsafterproteolytictilgestion and/or denaturation by heat or alkali. hybrids were detected in semi-thin sections by standard wlourimetric methods. and in thin sections by immunogold techniques. elaborate blocking procedures and prolonged washes were found to be unnecessary. specificnu~l~earsignaiwasseenatbdhthelightandemisv~iwiththewhalednaprobe,revealingdetailsof nuclear dna distribution not evident in paraffin sections or cytospun preparations. non-specific binding and background were minimal. signal was greatly reduced if the denaturing step was mined, and was slightly increased by protwlytic digestion, though at the expense of cytoplasmic morphological integrity. while the sensitivity of this system is limited by the fact that hybridization occurs only at the suriace of the section, it is a rapid and specific means of nudaic acid detection, and offers the dossibilitv of accurate iocalization of intracellular human and viral nucleic acid sequences at a fundamental level. of prostatic carcinoma, mouse liver, kidney and gut were used. acp was demonstrated using an kc-dye cwpling method. the acp was unaffected by the addition of mm tartrate. although the acp was known to be 'tartrate sensitive'. the addition of mm tartrate or mm sodium fluoride weakened but did not eliminate the reaction. mouse tissues, prostatic carcinoma and leiomyosarcoma tissues were processed using variausfixatives and embedding procedures. these were tested for acp and trap. the acp in the muilinucleated giant calk of the leiomyosarcoma survived standard formalin fixation and paraffin wax processing and was tartrate resistant. as expected mouse liver, kidney and gut acp and prostatic acp did nd survive most fixatives and embedding procedures. however acp could be demonstrated in those tissues processed bythe 'amex' method. i.e. fixed in acetoneat - o' c. processed throughmethyi benzoateand xyianeto paraffin wax. theacp.thus imalised in paraffin blocks was eliminated by the addition of mm tartrate to the incubating medium. if needed the amex pracedure prrsenes some tartrate sensitive acp in paraffin blocks. acp that survives standard fixation and embedding procedures is iikeiy to be tarfrate resistant. showed acp only in osteoclasts. various elements ot tissue processing procedures were examined to find the wnditionsnecessarytoachievemaximum acp localisation. tissue blockswere pr-ssbdintowaxusingstandard embeddingprocedures. threefixativeswereused. % neutral bufferedfmalin, formalcalcium and ethanol allat + 'cfor hours.twodecalcifyingfluids wereemployed, %ethylenediamine-tetraaceticacid(edta)ph . at + %, zo' c and 'c.andformicacidisodium citrateat+ %forw hats. formalinfixed edtatreatedtissuaat + % produced maximum acp activity. acp was shown in osteociasts, some ostwcytes, chandrocytes. cement lines., tide mark, periosteaicellsand large macrophage likecells inbona matraw. formalinfixed tissues decalcified in edtaatzo'cand 'c werenegativeforacpaswerealltissuesfixedinethana . lnformalcalciumfixedtissueand formic acidlsodium citrate decalcified tissue the acp reaction was weaker. with some elements, e.g. chondrocvtic acp missing. all the acp preserved through paraffin processing was tartrate resistant. it is well recognised that morphology and optical resoiution are vastly improved with resin as opposed to paraffin embedding of tissue. however, difficulties in producing consistent immunopetoxidase reactions on resin sections have caused many departments to abandon thetechnique. although antigenicity is preserved the low optical density of diaminobenzidine (dae) means that the reaction product is barely visible in thin resin sections. the aim of this study was to develop a method whereby antibodies wmmonly used on paraffin sections wuld be successfully applied to pm resin sections. tissues fixed for - hrs in % formol saline were partially dehydrated and infinratedwithlr whiteresinat °cfollowed bypolymerisationat cusingacatalyticmethod. pmsectionswere reacted with polyclonai and monocionai antibodms using a standard indirect immunoparoxidase technique. visualisationwas with a silveramplification systemfordae (amenham) appiiedasthetinalstage. excellent resuns have been obtained with a rangeof antibodies including s , von wiilibrand factor. immunoglobulins, uchll and l (dakoltd.). usingthistechniqueitisnowpossibletocombinehighresoiutioniightmicroscopywithaprecise immunocytochemicai reaction. the advantages are obvious and are particularly relevant in the field of lymph node pathology. however, at and gythe number in muscle dropped significantly . and hoursfoilowing treatment, but hadrecoverad to w n t m levels byfivedays. in the lamina propria, agnor numbersincreasedinitiailyaflerthe . and gy treatments but returned to control values by five days. with gy the agnor numbers showed a significant fall hours after irradiation: this decline continued up to five days. it is evident that &nor numbers within the small intestine are affected following irradiation. the variation in counts is dependent on dose, cell type and time since irradiation. karen m. britten, w. r. roche oepartment of pathology, southampton unwewty general hosp~lai, southampton, so xy in response to ever-increasing demands for immunophendyping in inflammatory disorders, we have developed anernatwefixation and embeddingtechniquesforsmall biopsy specimens. bronchial biopsieswerefixed in buttered formaiin and processed lor embedding in araidite or were fixed in acetone containing protease inhibitors and embedded in the water-soiubk resin glycol methacrylate (gma). gma allowed for the investigation of a full phenotypic profile akin to that which may be wfwmed in frozen section while yielding far superior morphology and greater numbers of d a n s from small biopsies. the phenotypic markers included those for t-ceiis (cd . cd , cd ,cd andlfal),macrophages(cdllc, cd ),mastcelis~b b andaat)andeosinophils(mbp, egl and eg ). wehaveaisodemonstratednautrophiieiastase,cytokinesandtheceiiadhesionmolecuiesicaml .elamand vcam. similar high qualty sections were obtained with araidne but the repertoire of antibodies was restricted to thoseantibodieswhichcannarmailybeappiiedin paraffin. wesuggestthatforsmaii biopsies whichrequiredetailed immunohistochemistry, such as in the areas of transplantation and mucosa immunology, fixation in acetone at - % with the inclusion of protease inhibrors and processing into glvcolmethacwiate with careful temderature control gives optimum results. these included % formaiin at vanous temperatures, microwave treatments. bouin's fluid and a wmmercialiy avaiiabieproduct "rapid fix".thetissueswereroutinelyprocessed. embedded in paranin wax andstained wilhthe haematoxylinandeosin method. an eval"ation~fthefi~alion methodswasaisocarriedoutforimmunocytachemicai stains. from microscopic examination of the results t was evident that for the haematoxyiin an+eosm stains. % formaiinat o'cwa~ wnsistentiytheoplimum methodof choice. farimmunocytochamistryali methodsrasuited in a poor performance using a standard trypsinisetion time. however, an acceptable result was achieved from % formalinat °canda microwave fixative method. byreducingthetrypsinisation time. this. however, required strict wntroi. thus it is possible to fix tissue within one hour. by a method which is cost effective and which can be used within limitations for immunacytochemistry. we set out to study the extent and time dependency of storage related artefacts in cytospin fluid (shandon). we assessedtheeffectof stwage in cytospin fluidon the nuciearmarphoiqlyof breast fnac. fnaspecimsns were coiiected on day and slides were made from each specimen on day . day and day . these were fixed with a spray fixative. between each sampk preparation, the cytospin colleaion fluid was kept at %. the slides were fauigen stained and nuclear morphological parameters (area, perimeter, form ar. farm pe. and convexity-concavity) weremeasured using aseescansoiitaire plus imageanalysis system. there wasasignlticantchange in ail the parametem betwwn day and day , and between day and day (p < . ). the measurements were repeatableon differentoccasionswithoutsignificant difference in theresuns. these results indicate thatstorage in cytospin fluid significantly aners nuclear morphology. the nuclei become progressively larger and more irregular in shape. il is therefore important to standardize the storage conditions of fna specimens. if accurate objective comparisons are required. the identification of macromolecular components of hydrated patholqlicai tissues revealed by low temperature scanningeiectron microscopy (ltsem) isan emergingfieldof enquiry. in orderthatthetechnicaidetaiisforltsem labeilingmaybeestablishedanexperimentaiprotocoi has beendrawninwhichiabeiparticiesofhighatomicnumber are visualized. the experiment comprises a mcdei system in which bovine serum albumen as a known amigen is dissolved in phosphatebutteredsalineand adsorbedontonitroceiiuiosemembtane. theantigen onthemembrane is subsequently reacted mth rabbt anti-cow antibody and then elther protein a-goid or goat anti-rabbn-gold. aner treatmentwithadeveiopertoaddaiayerofsiivertothegoidparticies,thepreparationsareanachedtostubs,rapidiy frozen in nitrogen slush at - ooc, coated with aluminium and observed on the ltsem stage at approximately - °c. backscaneredeiectronimagingathighacceleratingvoltages( kv)isusbdtodetectsilver-enhancedgoid partic es.these areciearlyvisuaiized. thefrozen hydrated preparationsarestabieunderthe described wndtions of ltsem operation. in contrast dry, wnventionai. sem preparations are beam sensitive; their initially observed delicate uitrastncture quickly dearades an exeosure to even the moderate i kw electron beams emdioved in secondary electron imaging. a new approach to systematic storage of pathological specimens for low temperature scanning electron microscopy as greater numbers of pathological specimens are stored for wnvenient future imaging by iow temperature scanning electron microscopy (ltsem). it is increasingly important to maximise expensive cryostare capacity. a cassenesystem hasthereforebeendeveiopedwhichincreasesthe hoidingcapacityof cryostoresbyafactord . over conventional bee storage. each c-he consists of an aluminium disc . mm in diameter wilh cylindrical wells mmindiametereveniyspacedand . mmapart. specimens. mountedan jeolstubs,areretainedinthe wells by set screws. five cassettes fit into each x mm glass storage jar, gwmg a capacny for a canister cryastore(w th @rs percanister)of oostub-mounted specimens an addtionai benefit ofthecassettesystem is that each specimen is afforded a mechanically and thermally protected environment. specimen collection for ltsem may take place in a distant operating theatre or laboratory. lmmobiiiration of specimens whilst they are in transit excludes any possibility of their being damaged during normal handling and improves the prospects for survivalof accidents.afurther benefitisthataparticuiar mountedspecimencannow bemorespeedily identlledand removed for ltsem from rs well. many pathological condtions are charactensed by the presence of cellular degeneration accompanied by cytoskeietai abnormalities. in many such diswders it is not clear whether cytoskeietai abnormalities are pftmary process or are part of a secondary response to cellular insun by other agents or mechanisms. we have used a fibroblast ceii culture mcdei to study the effects of physical distension on the cellular cyioskeielon. inert beads of and pm were introduced intoceils byallowingendocytosisor bymicroinjection.thecytoskeietai response tothese beads was studied using immunofiuorescence microscopy. beads introduced by endocytosis migrated to the prinuciear region and over hours became enmeshed in intermediate filament and mlcrotubuiar aggregates. actln microfilament organiselionwas not alfected. incontrast. microinjectionof beadsproduced an immediatecollapseof morofilaments,wiulvimenbnandtubuiin distribution being presened.the immediateresponseto microlnlectlon is similar to the collapse of actin filaments s w n upon thermal stress. this experimental model has shown !hat aggregates of cytoskeietai proteins may be produced within cells as a secondary responseto intraceiiuiar debris, and that microinjection may induce cytoskeietal abnormalnies similar to those seen in thermal stress. thedetection of numerical chromosome aberrations m interphasetumourceiis bynonisotopic insitu hybridisation has been previously described but the application d t h s technique to paraffin embedded material has been wmpiicatsd bythe requirementfottissuesectioning with the production of partial nuciei. inthisstudy, the analysis - pm thickparaffin sectionsolconventionally processed caski ceilsusing both human papliiomavlrus(hpv type l e a d achromosome specificalphoid probe wascompared with resunsomained usingintactceils.the use of sectioned material did not give signal distributions comparable to those obtained using whale cells. this is consistent wnh a mathematical model derived for the relationship between section thickness. nuclear size and nuclear retention in paranln sections. a method was therefore developed for the extraction and analysis of nuclei fromthick( pm)paraltinsectionsandappiiedtotheanaiysisof squamouscelicarcinomasofthecervix(n = ). the number of copies of chromosome varied from to and this variation was apparent both between lesions and between tumour ceiis within the same ieaon. this is comparable to resuns obtained wth cervical carcinoma derived csll lines. there was. however, no clear relationship between the presence of hpv sequences and chromosome number. these preliminary results suggest that the postulated loss of host suppression of hpv gene function by deletion of genes on chromosome does not occur through gross chromosomal abnormalities. membranous giomeruionephritls (mgn). an immune-mediated disease, is a frequent caused renal morbidity in man. cationic bovine serum albumin (cbsa), given to nzw rabbits in a chronic serum sickness-type protocol. is knowntoinduceglomerularchangessimilartothe humandisease. we~ssedtheenectofashortcourseofthe immunosuppressivedrug cyaon thedevelopmentof early stage.cbsa-induced mgn. fourteen male nzwrabbts received an iv immunismg dose of mg cbsa and p g e. coliendotoxin. one week later they commenced daily iy inlectionsaf mgcbsafor ~onsec~t~vedays. three rabbtsweresacftficed atthistime. sixofthe remaining rabbits were commenced on a short cou~sb of o m cya. whilst continuing to recewe daily doses of cbsa. the remainingrabbitswsregivencbsaonly.after consecutivedosesofcbsathesell anlmaisweresacnficed.ali rabbns given doses cbsa showed early stage mgn and those given doses of the cationic protein showed a more mature, established disease (thickened glomerular capillary wails wth diffuse. global. granular deposition of igg/c and subepithelial electmn dense deposits). four of the cya-treated cbsa rabbns showed a marked reduction in giomerular capillary wail c deposition. three of these rabbns had considerably less severe disease ultrastructuraiiy. these results suggest that cya may aker the course of cbsa-lnduced mgn. 'oroartment of pathology university of edmburgh m&~aischooi: late renal aliogran loss s due to arterial intimai proliferation and iumenai narrowing. there are few studies ofthe phenotypes of the intimai cells. we analysed these lesions by light microscopy and immunmyiochemistry using antisera against t-lymphocytes. b-iymphocytes. macmphages, smooth muscle cells, class ii hla or molecules and the proideration antigen pclo vessels were studied from gran nephrectomies resected between and months post-transplantation. we identified an arterial endolheliaiilis, r-rted in cardiac aliografts, but not emphaszed in renal graftrejection. panernsolarterial pathoiogywwerewgnisd: ( ) endotheiialitis in nnteriobuiar arteries without intimal proliferation, ( ) endothelialiis in larger arteries accompanied by intimal prdtferat!on of smooth muscle. ( ) '"inactive" lesions with thickened intima (i foam cells) but no endotheliaiitis, and ( ) 'natural" atherosclerosis of larger arteries. the endothellaidis tended to m u r in shmw sumving grafts. the pedominant cell was the macrophage. with fewer t-lymphocytes. pclo was expressed in mononuclear ceiis, smwth muscle cells and endothalid ceiis. pmicuiariy. but not exclusively in younger grafls. we proposb these ledonsevolve, variably. from an early endotheiiaiitis to late chronic vascular rejmlon or gralt athwmierosis. the predomlnance of the macrophage at ail stages, suggests it plays a significant roia in the evolution of these lesions. the rat is used to study the response of the renin-angiotensm system m diseases such as hypertension there are structural differences in the jga butthere are few comparisons of its response to stimulalion between species we used renin antisera and an lmmunoperoxidase technique to stain renin-containing cells (rcc) ~n rat (n = i t ) and human kldneys(l nephrectomy and autopsycases). westmulated therenin-angintensin system experimentally by clipping one renal artery ( rats) and inducing sodium depletion ( rats) and studied the analogous human diseases-renal artery stenosts ( cases) andaddison'sdissase( untreated and treated eases). wecounted the rcc and plotted their distribution on scatter diagrams. there were some dinerences in distribution between the species but in boththerewasagradient in distributionof rcc which predominated m the superficial renal cortex. in sodium depleted rats. recruitment of rcc in the iuxtamedullary jgas abolished this gradient. while in some of the animals with renal artery clip hypeitension the normal gradient was reversed, with most rcc in the deep cortex. in bdhu~nreatedaddison' diseaseand~nrenalanerystanosistherewasanoveralincrease(xs)m rcc butthenormal gradientoftheirdistnbution withintherenalcortexwasmainlained.theseresults haveimpl,cationsfortheroleolthe intrarenal renin-anglotensln system in the control of renal haemodynamcs. we have developed a model of adult human prostatic epdhelium that allows arch$tectuml and cytologioai features to be maintained. epithelial organolds produced by enzymic digestion of human benlgh prostatic hyperplasla tissue weresuspendedm type collagengelandsubcutaneouslyxenogranedintointact malenudemice.thexenogranis progressively invaded by mouse stromal cells thesesurround theepithelial organoids and supportthe reformation of epithelial structures with a lumen, lined by a mixed epithelial layer. as the lumen forms tall columnar epithelial cells begin to be seen, these progressively express the prostate specific epithelial markers. psa and psap further the xenograflsexpress appropriatecyiokeratin markers in boththeluminaland basaleplthellal cells. gelsplaced within a . pm millipore chamber, which do not undergo stromal invaston. lose all epithelial organisation with disorganised sheetsand ballsofcellsbeingfound.these cellsdonot express thesecretory markers. in the absence lanandr genic~tim~i~sep,th~l~~l structures wsth a lumen are formed but there are no tall columnar secretory cells and noexpression ofthesecretory markers. thismodel hasfurther been investigated todetermine theresponseof human prostatic cells growing in vivo to the antiandrogen flutamide and to a -aza-steroid e-reductase inhibitor these observations indicate the essential role of both stromal cells and androgens m dictating functional dlfferentlatnon this model will allow the dissection of the regulatory processes involved in prostatic differentiation. hepatocyte growth factor (hgf) is the most potent known mitogen for adult rat hepatocytes in pimary cuiture and isthoughtto haveanimportant rolelnlivergrowihandrepalr linle~sknownaboutthemechanismofactionof hgf on hepatocytes. since the adenylate cyclase system has been implicated in hepalocyie growth control. we examined the role of adenylate cyclase and cyclic amp wmp) in hgf-stimulated dna synthesis. human recombinant hgf (hrhgf. numl) had no elfect on baa or stimulated adenylate cyclase activity in membranes prepared from freshly isolated rat hepatocytes. similarly, hrhgf had noeffect on intracellular camplevels in cultured hepatocytes. furthermore, agents which increase camp inhibited hrhgf-stimulated dna synthesis m primary hepatocyte cui ures glucagon had an i . c. , of lo-'; fotskolin ( . ibmx ( "m). -broma camp ( pm) and dibutyryl camp ( pm) completely inhiblfed hrhgf-stimulated dna synthesvs. from thts. we conciude that adenylate cyclase and camp do not have a role m hgf-stimulated dna synthesis m pnmarycultures oladult rat hepatocytes. the receptor for hgf has recently been identified in tissues other than the liver, as c-met a protooncogenewith intrinsictyrosinekinaseactivitv. whether or notc-metactsasthsrecedtorfor hgf ~n heoatocvtes s currently under investigation a novel in wvomodel of intestinal differentiation is descnbed. fourteen-day, undifferentiatedfetal rat small intestine. stripped of the major part of s mesenchyme. then suspended in a type i collagen gel and renografted ~n a nude mouse. undergoessmall mntestbnal morphogenesis and cytod$ffeienbat$on. all tourmapr epithelial imeages. namely paneth. goblet. columnar and endocrine. are present. double labelling m sit" hybndlzatlon, employing biotinylated and digoxigenin labelled dna probes to whole rat dna and whole mouse dna, reveals an unusual ]uxtaposition of species specific stroma. the outer longitudinal smooth muscle layer, and the major part of the lamina propria. p isthemost commdnlyalteredgene~n humanturnours. mutati~nleadstothep~oducfionolanabnormalprotein which can be detected by immunohistology. such abnormalales are seen in a wide variety of turnours, including colon cancer. abnormal p protein levelshavebeen detewed in & % of sporadiccolorectaltumours. in order to determine i p mutations occur in carcinomas arising from dyspiasla. we have investigated the prevalence of such mutations m colorectal carcinomas from patients wr)h long-standing ulceratwe colitis %). lmmunocytochemicalstalnlng waspwformedon freshand paralflnembeddeduccarclnomasand sporadlc carcinomacontrols matched forsite, stageand grade six areas ofdysplasla, (fourassoclatedwlth uccancers)and sporadic adenomas were also stained. p protein was detected by immunohistochem~stry m the fresh uc cancers and / ( %) ofthe paraffin embedded uc cancers usmg the antibodies cmi, pab , pab and pab . slmllarresuitswere obtained in thesporadic carcinomasw,th fresh cancers positlvefor p and / ( %) of the paraffin embedded cancers. two areas of dysplasla which were associated with p postlve uc cancer~also showed positive p staining, alongwith an adenoma. our results indicatethatunlike k-ras mutations, p proteinabnormalities occurat a similarfrequencyln sporadic colorectal carcmomas, carcinomasar sing !n uc as well as being present in uc dysptasasla. this work suggests that p mutations play a mle m the dysplastacarcinoma sequence. threepanernsofstainingareloundin humancolonusingatechniquetodemonstrateo-acetylationofsialomucins (mpasj. % afindivdualsshowunifwmmpas-positivity. theremaindwareeitherent~~ympas-negativewmpasnegativewith occasional positive crypts. we suggestthatthisrepresentspalymorphismof an autosomal gene ( sa) controlling -acetylation of sballc acid. isolated crypt-restricted mpas-posltivky in otherwise negative indivtduais representing somatic mutation of the ma* gene in crypt stem cells of osa'/osa-individuals. to test this we have studied colons from patients with rectal carcinoma, half of whom had received cgy radiation days preoperataely. radialiond~dnotaffecttheprevalenceofthethreepnenotypesbutincreasedthefrequencyofmpasposltlvecryptsinanegativebackground(t ~ ~~~ . ~ to*, p < . ), largelyduetocryptsshowingseclorial mpas-positivlty ( . x lo* ys . x to , p < o.c€ql). consistent with incomplete crypt colonisation by a recently mutated phenotype. the prevalenceofthisosa*/osa~phenot~e(radiated %, non-radiated %) isveqcloseto the predicted heterorygosily rate ( . %. hardy-welnberg law). thew resuns suggest that human colonic crypts are monoclonal with a longer stem cell cycle than the mouse and that mpas staining provides a method for measuring human stem cell mutattonal load. pouchrris m leo-anal resarvoirs is a frequent complication of restorative practocolectamy which is associated wtih constderable morbidity. long term complications of pouchitis are unknown, however patient follow-up wlth sigmoidoscopic surveillanceis mandatory to assessdysplamc or neoplastic changes in resldual rectal mucosa. we examined epithelial cell proliferative activdy in the ileal mucosa of pouch biopsies using the antibody pclo which detects nuclear expression of the kd nuclear protein proliferating cell nuclear antigen (pcna). ten patienls with functtoning ,lea-anal pouch reservoirs of at least one year duration were biopsied and assessed histologically for evldence of pouchitis formalin fixed paraffin embedded biopsies from anterior and posterior pouch wall were examlned with pclo using the pap technique. ten terminal ileum sctions from right hemicokctomy specimens werenormalcontrols. cryptsin pouch biopsiesand normalcantralshadasimilarpciolabelling indexmeanof k and % (mean values - and w ) respectiveiy the sides and tips of villi had a significantly greater pclo labellmg indexinthepouch biopstes( mean. range - ). compared withnormaicontrols(t % mean, range - ) these resulk demmstiate an expanded proliferatwe epnhelial compartment in ileal pouch mucwhich in contrast to normal terminal ileum involves villous surfaces. considering the frequency and long term natllrs pouchitis. thesefindingssupportthe need for continued pathalogical and clinical assessmentof the ileal mucosa in the neo-rectum of these patients. we, cui, i c talbot, j. m. a. northover signiticant alterations in structure. function and gene expression of mltochondna have been reported in cobrectal turnours. but it $ not known rf these abnormalities are due to mitochondria genetic alteration. in this study, total cellular dna was isolated from l lo rectal carcinomas, adenomas and their adlacent histologically normal mucosa. these dna samples were digested separately with different restriction endonucleases. and then analysed by southern blotting using apurified mtdnaprobe. the restriction fragment pattern oftumourmtdnawas comparedta thatofeorrespondingnormalmucosalmtdna.theseresultsshowedthattherearenalargedeletions. insert~ons, or rearrangements $n turnour mtdna. and no single base changes m the delectable regions in spite of some polymorphic vanations. our results suggest that mtdna changes are unlikely to have a malor role in human colorectal tumoungenesis. hence, alterations in colorectal turnour mitochondria must be dependent upon other mechanisms crescentic colitis: the clinicopathological spectrum of a distinctive endoscopic feature in the sigmoid colon n. a. shepherd, s. gore, s. p. wilkmson oepariments ofnsfopafholcgyand gastreoterol~y, gloucestershre royal hospml, great western road, gloucesfw, crescentic coibs describes an endoscopic appearanc of the sigmoid colon charactwised by mucosal swellmg, eqhema and haemorrhagestrictly localised tothecrescentic mucosal folds. in alive yearpericd thisdiagnosiswas made m patrents, representvlg .a% of ail fbeopta endoscopies. there was a male predominance and most plients were middle-aged or elderly. dwerticulosis waspresent m most ( %) but theabnomallf~es wareconfined to the crescentic mucosal folds with sparing of the divenicular onfices. the malorlty of patients presented with a history of bleeding perano. histologically there wasaspectrum of changesvaryingfrommlnorvascularcongestian to florid active inflammatory disease with crypt architectural abnormalities mimicking ulcerative colitis. three patients presenting with crescentc colitis later developed the clinical, endoscopic and histopathological features of distal ulcerative coliitls: two other patients with a history of distal ulcerative colitis were found to have the charaderlstlc changes crescentic colitis only at endoscopy. three cases showed the histological features of mucosai prolapse the findings in thisstudy demonstratethat arelativelyspecificendoscopicfeaturemayexhibit a wide spectrum of pathological changes whilst luminal mucosal inflammation of the sigmoid colon, usually in a sociat on wiih diverticuioss. may mimic the pathology of chronic inflammatory bowel disease. asmall propartion of these cases may represenl a strictly locallsed form of chronic ulcerative colitis. of genome and antigen a n various anatomtcal sees the resuns have imolications lor the mechanism of e n t~ of mv into neurons and for mechanisms of transynaptic viral spread neuiopathology, fnstitute of pathology j chow, j tobias.' k co ston. t j chambers estrogen is generally considered to maintain bone mass through suppression of bone resorption we have previously demonstrated that administration of pharmacologic doses of estrogen increased bone formation in ovary-intact rats to assess theeffectsof physiological concentrationsol estrogen on boneformatian. estrogen was administered to ovariectamised rats fi n which bone resorption was suppressed by ahprbp. animals receiving exogenous p-estradtal (ed ( wglkg. t o wgikg and w pgkg daily for days) showed a dose-dependent increasemtrabecularbonevolumeaft %. . % and . % respectively, compared wnhthoseratstreated with ahprbp alone. the increase in bone volume was due to bane formation in e,-treated animals, in which bone resorption had been almost completely suppressed by ahprbp neither ovanectomy. ahprbp nor e,-treatment had asignificanteffecton thevolumeorrateoflormationofcortical bone thus. theiocreasedbonerasorptionwhichis aconsequence estrogen-dellclency entralns lncreased bone formation, whlch masks a slmunaneous reduction in estrogen-dependent bone formation. it thus appears that estrogen maintains bone volume not only through inhibition of bone resorption. but also through stimulation of bone formation. pgs m bone formation m vmo. which may represent a pathway common to bone anabolism that s observed !n response to many e""lio"mental t m"ll pgf , was wlthoul effect we found that nodule mductton by pgs occutred early the cunures, belore nodules lntegrln expression e n human bone was examkned by immunohistolqlical stalnmg of edta-decalcrfec and undecalcllned cryostat sections of fracture-and tumour-associated callus dbtalned at surgery and neonatal costochondral )unctions obtained at autopsy. cases were stained wlth a panel of well-charactsnsed monwlonai antibodies against p - . el- and uvp integrios using abc peraxidase and indirect immunofluorescence techniques. osteoclasts stained lor pl , p . u and ovp mtegnns. indicating that they express u p nlp- ) and uvp (classm vttranectin receptor) ostmblasts stained lor pl, and u , and osteocytes stained for p l and . indicating that ostmblasts express n pl nlp- ) and m gl nlp- . classical fibronectin receotorl and that vlp- expression s reduced or lost during differentiation to astwcytes j qulnn. n a. athanasau nufield depanment of paihology and bactenotogy, level , john radcoffe hospital, oxford, ox du osteoclasts are known to effect bone resorption in inflammation and malignancy but whether other cells of the mononuclear phagocyte system. particularly macrophages and macrophage palykaryons, are similarly capable of pathological bone destruction is uncertain macrophages derived from tumours (human lung camnomas. murine mmw-assmated mammary carcinomas) and inflammatory lesions (munne foreign body granubmas) were cuituredonboneslicesbothin thepresenceandabsenceaf stz slromalcells.theseneoplasticand inflammatory iesions contained a heavy macrophage infiltrate but no giant cells and no calcified tissue. there was superficial roughening ofthebonesurfacebymacrophagesboth inthepresenceandabsenceofst cellsand.aher t days. scattered areas of lacunar resorption n co-cultures of macrophages and st cells. normal pulmonary tissue macrophagesdid not produce resorption lacunaeunder these conditions. the results show that macrophages alone are capable of atype of low-grade boneresorption and that a subpopuwion of turnour inflammation-assaoated macrophages following specific interaction with slromal cells, can differentiate intocellscapable of the special sed function of high-grade lacunar bone resorpt on macrophages may thus directly contribute to the mteolysis associated with metastatic turnours and inflammatory lesms in bone differences in the grade of osteolysis may also account for clinical differences tn the degrees and rate at which pathological bane resorption occurs the george washmgton university medtcal center, washnqton, c , u s a for decades, new technological advances have been hailedor condemnedas representing the extinclion cla~s~cal diagnostic surgical pathology, but so far the reports of the death of our specialty have been grossly exaggerated. indeed, the newtechnologies of the sare seenas aids torathet than replacementsfor careful and intelligent gross and micmscopic examination and interpretation, which havealways remained the 'gold standard" against wnichnewtechniquesaremeasuted thedecade will bemarked by anexplosionof pos ibilitiesfort ueas well a non-tmue diagnosis, balanced by a shrinkage in both specimen sizes (already evident in breast pathology) and healthcare budgets.thus. thesurgical pathalogist will haveto becometoanevengreaterextentthecomplete physician. in order to beable to choose wiselyand economically from thediagnostic"menu"availab e. thesurgical pathologist w d also have a greater need than heretofore to be a competent cytopathologist as well, as many of his or her cases will also have fine needle aspiration material and a nuclear grading will assume greater significance o n tumor pathology. finally, the roles of the surgical pathologist ~n both research and patient care (including direct patteoi contact) will require emphasis m order to attract more medical students into our specialhl. womenorthosewithlowgradedisease.this hasledto theviewthathpv s~nielatedtocer~ c~l cancer asimple pcr protocol was developed using standards containing . - fg of hpv dna in ng of normal human placental ona toestimate levelsaf hpv dnain smearsor biopsies. the quantltativedistribution of hpv dna tn normal and abnormal ~ervlcal epithelium was mapped thmughout loop biopsy and hysterectomy specimens using micro-dissectionand histologyof alternate millimeter slices levels were measured in parallel c~n~c~i smears. preliminaryresuits show a low lwel of hpv dna(equiva ent to one copy per cells) was usual in women with normal smears or low grade abnormalities replication of hpv dna to a level of one copy or more per cell was limited to areas of cin ~ which may be very small and to atypical immature metaplasia. the level was reflected ~n the smear aswitch to a high level of hpv dna s a biological and potential diagnostic maherfor high grade precancer. cerv cal biopsies were stained by the cracker technique to demonstrate agnors. for the purposes of the study, b~opsiesweredividedlntaflvegroups-normal, kallocytosis.cini.cin iiandciniii.usinganiemps/ hastingan a m s photon framestore a customised image analysls system was created. the image s dllated and eroded to close any cavities an outlining procedure locates the agnors and records their mid-pants. from this st of midpants themeandistanceofeach agnor'slhreenearest neighboursiscalculated.thlsprocedurescanledoutfor several lelds of view. this study attempts to relate the shape of the associated "mean distance" histograms to the histologlca diagnoses in flry cases. primary adenocarcinoma of the cervix: a retrospective clinicopathological study of cases r ananoos, kamta nahar.' alison b,grigg.'s. roberts? sergin m. sma the patientfen healthy with a tendency togain weight being heronlycomplaint. gynaecological referral revealeda week gestation abdominal mass and bilateral varicose veins. laparotomy, lee oapharedomy, total abdominal hysterectomy and right salpingo-oophorectomy were carried out the len ovarian turnour was cyst~c, smoath surfaced. unilocular,cantainedturbid brownnon-greasycontentsand was cmln diameter.al cmsquarearea of lhe lining showed papillary tutmg. and the remainder was smooth. hiatology of these papillary areas revealed a papillomatow proliferative squamous neoplasm with keratinization and mild acute inflammatm. wthout evidence of cytologicalatypia.stromal invasionora benigncysticteratoma.theremainderofthecystshowednat swamous. cuboidal, or low-coiumnar type epithelium with an underlying vaguely endometrial type stroma in minute focl, with occasional small secondary cysts thls unusual squamous neoplasm was diagnosed as a proliferative epbdermoid meglomerularperipolarcell isan epithelial cell snuted atthe vascularpole ofthe glomerulus recently, theexistence of the penpolar cell has been doubted the aims of this study were first, to establish whether the peripolar cell is a m q v e cell type ~n the mammalian kidney. and second, to compare their numbers and morphology in different speces we used scanning electron microscopy to study - kidneys from each of mammalian species including man. we removed the glomeruli by microdissection and examined a minimum of vascular poles from each kidney penpolar cells were largest and most numerous m goat and sheep kidneys ( w% and % of glomerul~) meywerescant~estandsmallest humanand bovinekidneys( % and %).whileaspectsotsome per~polarcells resembled podocytesand otherperipolar cells had features reminiscent of parietal epithelialcells. in each ~pe~ie~wewereabletadistinguishperipolarcellsasaspeciflccellhlpe. weconcludefirst. that theglomerular peripolar cell s a unique cell type ~n the mammalian glomerulus, and second. that their morphology and number cystlc renal disease occurs in various forms characterised by dilatation of different pans of the nephron. the morphologoal, clinical and genetic features of these diseases is variable but some animal models have been developed m an attempt to understand the mechanism of cyst formation within the nephron. we describe a polycybtic . .ofthe kidneyinthscea-n mousew~hanx-linkedrecessive~mmunodetic,entsyndrome there is progressive cystlc dilatation affecting all parts of the nephron. the cyst lining is composed of a single l a y m c epithelium with focal nuclear crowding and the formation of micropapillary structures. the cystlc epithelial cells show subnuclear vacuolation focal basement membrane thickening is also a feature. there is no signdicant inflammatory infiltrate present within these ktdneys. election rni~roscoptc examination w e a l s that the subnuclear vacdation s due to loss of the membrane infoldings at the basal pole of the eplthelial cell wlth flud accumulatlon within the extracellular space. the basement membrane thickening is due to expansion of the lamina dense. the finding of a plycystic kidney lesion on these mice offers an opportunity to investigate the relationship beween the immune system and renal cyst formation. eqthropomtin. a circulating glywprotem, is the principal humoral regulator of elythropoiesis. it s produced in the kidney butthe precisecell oforigin iscontroversial. erythropoietin participates in a~ ~ssicaif dbackcontroi system whch attemots to restore oxvaen deiivew lo the tissues. normallv erdhrowiatin is resent in the sbnm at picomolar c mntrations bui revels may h e up to oid durilg ?&erehkpoxic ;tress. the mechanism linking renal oxygen sensing with erythropoietin synthesis is p o~l y understood but there s evidencethat cells of the innsr cortex respond to tissue hypoxia by producing elythropoietln. eryihropoietin geneexpression in the kidney can be detected by northern blot anaiysis within one hour of exposure to hypoxie stimulation. the cellular location of erythropaietin m-nger rnawasdetected inamurine model byinsdu hybridizationemploying radioiabeiled dna probes and autoradiography techniques. tubular cells of the inner renal cortex wbi b identifled as the main site of ervfhropoietin gene transcription. the location oterythropaetm production was confirmed by immunohistochemistry with anti-sera ra sed to pure recombinam dna-denvea human elythropoetin. the specific antibodies bound oniy lo tubular cell cytoplasm, confirming the tubular location of erythropoiet,n-producing cells. severe reversible acute renal failure and iga nephropathy r jackson, k l. c. mclay unrvemty diws~lon of pathology, giasgow ropl infirmary, castle sireel. glasgow, g osf whilst progressive, irreverable renel failure is well known in the evolution of a stgnificant proportion of cases of iga nephropathy, disease-related. reuerstbie, severe deteiioretion in renei function s a less well recognized phenomenon. relatively few cases aredocumented in the literatureand these have invariably been associated with m-iw haematuna. renal failure has been largely attributed to a combination of tubular obstruction by red - cast~andacutetubularn%crosisthaughttabetheresunof adirecttubulo-toxiceffectof released haemoglobin. in a review of cases of i@ nephropathy we have encountered two cases of this type. the clinical. histological. immunohistochemical and ultrastructural data relating to these cases are presented. on the basis of this evidence and an analysisof the relevant literature. isconcludedthat the accepted pathogenetic hypothesisdnot provide an entirelysatisfactaryexpianationof the phenomenon. itisfurlhersuggested that immunologically mediated injury dfrected at the glameruiar vascular pole and possibly the extraglamerular mesangium could compromise thetubular blood supply and might also lead to haemorrhage into the distal tubule at the macuia densa. scanning electron microscopy (semi when combined with electron microprobe analysis (mpa) s an expeditious and accurate method of identitying substances deposited in tissues. we report a case in which these techniques were used to study crystals in a transplanted kidney. the patient presented at age three with vitamin d resistant rickets and fanconi syndrome on the basis of cystlnosis. he gradually went into renal failure and received a kidney transplantfrom hismother. itfunctionedwsiifarafewmonths butthen deterioratedandwas removedone year after transpiantation.afewcrystalswereob~ed by light microscopy ma biopsytakenattwomonths. arepeat biopsy at four months was examined by sem and contained the typical hexagonal cryystais of cystine. the s o i i i sulphur peak obtained with mpa confirmed their elemental compojltion. the hydrophilic nature of the cystlne prevented their identificetion within routine sectionsfor transmission electron microscopy (tem). the hexagonal o~tline of aystaiswas confined to interstiitat macrophages rather than tubular arglomeruiar ceiis. pm frozen sections mounted on carbn pianchets diowed vlsualisation and analysis of the crystal deposits by sem. the use of appropriate processing methods may be crucial m identifying crystal dspasition within pathological specimens. during the year period - . new cases of dense deposit disease (m~~b,a"op,olif~=ti"~ giomerulonephritis type ill were diagnow in nonhern ireland. two further known patients developed recurrent disease in renal transplants. the mean age of onset was years (range eight- years) with five males and eight females affected. renal function was impaired in % patients at presentation which ranged from nephrotic syndrome ( %). nephrilidnephrotc syndrome ( %) and macroscopic haematura with mild protelnuria ( %) lo mild protemuria f microsopic haematuna ( %). serum c was universally low. haw of initial biopsies showed cr-entsusuailyaiatedwith rapidansefrenaifailure. % ofpatientsrequired diaiysisatorwithin twomonths of presentation. a further % d patmnts developed renal failure within three pars of disease onset of the six patientstranaplanted haw havedaveioped rmrrent diseasin their grafts six patients havenot required dialysisat ameanfoilaw upoffour yearr(ranse neloeightyears). mhaugh reialweiyuncommon, dense deposit disease isan important causeof renal failure in children and adolescents. we have confirmed that this disease has a poor outlook in terms of renal function with a tendency towards recurrence in grafts a review of renal biopsies from patients with diabetes mellitus. [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] . showed with diabetic giomeruionephropathyalone, with other renal diseaseand with both diabetic giomerulonephropathyand other renal disease. eight with the histologiwl appearances of diabetic nephropathy without known diabetes were alsn identified. clinical details were available on ofthe patients. these showed that in type (insuiin dependent) diabetes ( ). most ( ) showed diabetic giomeruionephropathy, showed both diabetic glome~lon~phropathy and another diagoaas, and showed aniy another diagnosis in type (non insuiin dependent) diabetes ( ). showed diabetic glomerulonephropathy with showing another renal diagnosis without diabetic giombrulonephropathy, and with both diabetic glomeruionephropathy and another diagnosis. patients investigated with nodular glomerulosclerosis resembling that seen in diabetes but not known to be diabetic at the time of biopsy, showed abnormal glucose metabolism on investigation and t showed a paraprotein. this review showsthatinpatientswithciinicairenaldiseaseanddiabetesmeliitus,therenaldiseasecannotalways beassumed to be diabetic nephropathy especially ~n type diabetes where a wide range of other renal disease can occur histological findings of diabetic nephropathy may occur in patients before diabetes is diagnosed cllnlwiiy. a j. hawie, r. l. bryan in a biopsyof arenaltransplant. we notedan artmalabnormalitythat wasunrecorded in standard texts. in an arcuate artery. theiewastheappearanceafformation ofanewartery inside theoid, with layersof muscleandelastic fibres that in places resembled an internal elastic lamina. separated from the original internal elastic lamina by loose connective tissue. in a systematic study of consecutive transpiant biopsies and consecut~ve transplant nephrectomies. another six examples of this abnormality were found. these six and the index specimen showed changes cons~stent with chronic vascular rejection. in a systematic study of consecutive nan-transplant renal biopsies showing interstitial nephritis. another example ofthis lesion was found, in a kidney showing chronic interstitial nephntis. thischange isprobablyavmantfom of muscularisation ofthsarteriai intima, isseen in chronlc renal damage, and occurs in transplanted and native kidneys. localised myloidasis ofthe lower genito-unnarytractisararedisease. few studleshaveanempted tocharacterise theamyloidtypeusing immunohistochemicalstains. wereportaseriesofnin~casssinvoivingthebladder( ), lower ureter ( )and penile urethra( ). thesecomprised malesand pfemaies with an age ranged & years. nonsof the patients had evidence of systemic amyioidoss. three patients had a past history of lower genito-urinq tract infection and of these, one had repeated instrumentation. one patient had prostatic carcmma the cmmonast presenting wmpiaint washaematuriaand the mostfavouredsurgicaldiagnosiswasa neoplasm. fowpat#ents had repeat biapsiesfor persistent orrecurtentamyloidosiswith atimeintmal ofupto t yearsfram initial presentation. bght microscopy showed amyloid deposmon throughoutthebiopsies, in lamina propria. muscle. adiposetissueand vessel wails. with a variable giant cell reaction and lymphoplasmacytic infillrate. an abc-immunaperoxidase techniquewas usedwith antibodmsto pcamponent, serum amyioidaprote n prealbummandkappaand lambda light cham in an attempt to classify the amyloid type. this was found to b e d non-pa. non-prealbumm type. a negative or equivocal reaction was seen for kappa and lambda light chains however, such antibodies may not necessarily be immunoreactive with light chains or fragments of light chains m amyloid deposits. features of bile reflux type gastntis: glandular distmion (branching, "carkscrewmg"). nuclear regeneratwn, intramucosal smooth musclefibresandpaucityof inflammation. helicobacterpylanarganiams werenoted bileacid concentration #n gastrc aspirates collected by endoscopy at the time of biopsy was estimated using an optical densitymethod.gi casesshowingglandulardistorti~ hadraisedgasttic bileacidconcentration.m cases with mahednuclearregeneration hadraised bileacidsandof case~withinbamucosalsmoothmuxlefibres had raised bile acids nine of cases with raised bile acids showed a chronic inflammatory response the histological features of bile reflux type gastritis which correlate best with estimation of bile m gastric juice are glandular branching, "corkscrewing" and intramucosal smooth muscle fibres whereas pauclty of chronlc inflammation does not h pylon was identified in of cases with raised bile acid. aims to compare the bacterial flora of normal and inflamed appendices. and to conelate this with various histological features. to establish the incidence of yeninla infection ~n acute appendicihs !n southampton. methods resectedappendicesweresentheshta histopathologyandal cm portionwasremovedandcultured for aerobic and anaerobic organisma. the remainder was fixed ~n % buffered formalin and pocessed for rwtine histology results: appendices showed acute ap+endicitis, and were normal. histology there were statistically significant differences between the two groups for the p s e n c e o f faecollfhs (p < . ). fibrosis (p < . ) and prominent follicles (p < ) -all were more common ~n the normal appendices. microbiology no yer~inia was cultured in enher group. there were statistically significant differences in the number in each group which grew anaerobes and streptococci. anaerobes were more common in the normal group (p < . ). and streptococci morecommon in inflamedappendaes(p< ) conclusions. .thereisanaiiwed bactenalflora in acute appendicitis. yersinia does not contnbute to acute appendicitis in southampton. faecoliths. fibrosis and prominent follicles are significantly more common in normal appendces le antigen (using cat - ) a i d type h antigen using a specific moncbnal antiserum. we f w n d a widespread distribution oftype structures in both benign and malignant epnhelium oftheextrahepatic biliarytract and ampulla of vater. type h antigen expression was seen in benign epnhelium and in non-papillary tumoun and ampullary carinomas. lmmunohistochemical detection of tyw blood qrou antlqens does not adpear to be ofprwnostlc the majority of cases bearclose relationship to human ulcerativecolitisclinically, endoscopcallyarid i" re-nseto treatment one hundred postmortem iiyerswerestudiedfromconontoptamanns with totalseversul~ataecolitis who were pathogen free and had a histological picture resembling human uicerallve cdhs only iivwere normal in there was a mild penportal chronic inflammation. had extens~ve steatosts, had an appearance resembling chronic active hepatitis and ~n the histology resembled that of sclerosng cholangitja. other hepatic pathologies were s e n in smaller numbers. these changes parallel the itver disease seen in association wlfh "icbrativbcolitisinman. webelievetheconon toptamannprovidesthefirst modelof l i~e r d i~~~~e m u l~e r a t i v e~~l n to allow sturty of the pathogenesis of the enralntestinal manifestations of ulcerative colitis. r-rvolr and ilm-anal anastomosis. pre-and pan-surgical specimens were studied and compared using routine histological and histochemicaltechniqm themalontyofcasssexhibitedan lncreasem chronc lnflammatlon wlth n or noacute inflammation.the levelsof chronic innammation were found to bemost severe i" those cases which ~~fferedhompouchiti~. morphomstncalanalysisrevealedan~ncreaseincrypt depthtovillousheightration(cdvh) ~n % of cases. the cdvh in the pouchitis cases was greater than n the other specimens studied the resuns of m u m histochemical analysis did not show any charactenstic changes occumng. in the cases studied. lectm histochemistry demonstrated an increase an supranuclear staining of the pouches with dba. sea, wfa and wa. which is similar to that fwnd in poximal colon. staining with psa, lca. u f and lta revealed charactenstic changes in binding panem th-may mdjcale changes m tucosylation of certain cellular canponents. certain changes in lectin binding were found to be more sjgnificant in the pouchitis group of cases. the changes in the ra~ewoir mucosa are most irkely to be an adaptive response to the new mtra-lummal environment wlth the acqmdfon of cmain mime chaiacteristics. thereare alsosp~ciftc changes which occur to a greatsrdegree m the cases with wuchits, these may occur as a result of or lead to the occunence of puchms. these resuns may be cryostat sctionshomten patientswith ileal pouches lor extensive colitis, ten distal ulcerative ~olitics and five normal mall intestines were assessed immunohistochemically for macrophages (leu m antibody). rfdl gnterdigitating antigen presenting cells). and rfdt (resting macrophages)antigencantainingcellsin lminapropna. result~areexpressbdasape~=~~t=g~~f thetotal. normal small intestine contained a mean ? % and % (ranges - . - , and - ) leu m . rfdl and rfdt positive ~ell~re~pecl~~elywh~chcontrastswiththoseh~ghervaluesinpouchbiapsies- %, %and %(ranges - , - and )respectively. distal ulcerative coilticscontalned increased macrophagenumbers( . and % for the three antibodies) but have a predominance of the rfdi positive population in contrast to pouchitis patients in whomthe rfd positive cells predominate thesefindings demonswate remarkably high macrophagenumbersin pouchitis, howeverthe rfd positive sub-populations predominance suggests an aetiologv other than ulcerative colitis for muchitis. margaret balsltis. yah mahida the wide family of cell adhesion molecules (including the subgroups of the immunoglobulin supergene family. the integnn receptors andtheselectins) are involved in controlling interactions between endothella cells and leucacytes !n inflammatory states. this is partly by their influence on adhesion and migration of leucacytes to and through endothellurn. using monoclonal antibodies, we have compared expr~ss~on of the three cell adhesion molecules icam-t (interceiiular adhesion molecule- ). elam-t lendothelial leucocyte adhesion molecule-t and vcam- (vascular cell adhesion molecule-t) ~n normal colonic mucosa with muwsa from cases of inflammatory bowel disease. wehavefoundexpressianafthese threemoleculesto beincreased~ninflammatarybaweldiseaseandthis change involvesendothelial cells as well asleucocytes we will dixussthbrelationship between adhesion molecule expression and dlsease actlvlty and posslble therapeutic implications granulomatous enterocolitis associated with therapeutic irradiation . . mangham. k m newbold, s dover the universty of bmmgham department of palhology, sci ool of m&mi science the medrcal schwl, edggbaslon. chronic radiation-induced entero~olitis s a weli recognised complication of radiotherapy for mtra-abdominal or pelvbcmalignancy. the pathological features includepwforatlon. fistula formation. segmental necros~s. and stncture formation. the histologicalfeaturesareessentially lxhaamic in nature. consequent upon the characteristic vascular damage. two case of inadtation-induced granulomatous enterocolltis are described ~n which non-caseating epithelcoid granulomas were present ~n the bowel well. to our knowledge. granulomas have not been previously &="bed in radiation coldisor any other formof ixhaemic bowel disease thsgranulomaswere largely confined to the mmosz-associated lymphoid tissue and the draining lymph nodes. naked submucosal granulomas were, however atso present. this distribution is similar to that seen m crohn's disease and tuberculosis. no evidence of these, or any othersystem!c ~r a n " l~r n a t~~~~~n d i t l~n~ was present these cases suppon theview that granulomas in bowel disease are secondary to an tmpaired muwsal bamer to antigens and highlight the non-specifiaty of granulomas m the diaynoscs of inflammatozy bowel diseases between and % of crohn's disease patients have involvement of the upper gastrointestlna tract the identification of granulomata is normally required to make a specific diagnosts but this may be dinicult in small endoscopic biopsies. we present t o cases where there was a strong suggestion of upper gastrolntestinal crohn's disease with a definitive diagnosis established on more distal invoivbment. four of the biopsies contained granulomata. the remaindershowedeitherpatchy inflammation, local ulceration orvillo~s dtjtmtlon only in aneffort to establish whether crohn's disease biopsies contained specific macrophage or lymphoid populations we applied the~onoclonal antibodies. mt (cd ). uchll (cd ro)). muramidass. mac , alpha- antltwsin and kpt (cdg ) using immunohistochemistry. theresuns were compared with casesot "on-specific duodenitis orlelunltis crypt-restricted loss of g pd activity has been used to quantify carcinogen-induced somatic mutation of the x- samples of peripheral nerve (sciatic nerve) of mice from varylng age groups were examlned over the past years. development olage~related penpheral neivefibredegeneraflon was observed among mlce over months age. the spontaneous peripheral nauropathy was characterized by walierlan type degeneratlon. teased nerve flbres from mice of years old showed evidence of swelling of the myelin sheath. fragmentation d myelin into myelin balls or ovoids, with areas of segmental demyellnatlon. light microscopic examination of h e stained sections revealed axonal degeneration, and myelin fragmentation. with vawolation of nerve fibres. on ultrastructural studies. there was evidence of axa-myelin degeneration. axxoplasm showing dark stained bodles suggemng of degenerate mitochondria the myelin sheath a l~o showed disorganization and fragmentation myelin, sometimes with whorl lormatron rithcm the cyroplasm of schwann cetk, wtth evidence of autophagocytosis by schwann ceiis. in some inslancestherawere sgnsofproliferation at schwann cells. nervefibredegeneration affected both non-myelinated and myelmated fibres generally, therewereapparentlygreaternumbers fdegeneratenan-myelinated nerrefibres than myelmated nerve fibres. furthermore the numben of degenerate myelinat& nerve fibres were less than in seontaneous peripheral neuopathy in ageing rats (personal obsenat on) swntaneous peripheral neuropathy of sciat,cnenesmainlyappearedin miceaged manlhsormore, s eenonlvvelyr rely in mlceof months. younger mice did not show any evidence of this change age-related penpheral nelimpathy is not usually accornpan~~d by any clln,cal ,gns neur~. fhology. uhversrty ofiowa, iowa. u s a v~scuiarendothellalcclls(rn)conslitufet~elntertace betweenthe bloodstreamand thetcsske and pertoormsevetal key roles ~n the development immune and inflammatory responses endothellal cell from the braln display slgnlflcantly different momhoiogy from other en. having ilght ]unctlons between them and pauclty at micropinacytolic ves~cles dunng inflammatory conditions of the cns interaction between mlammataly cells and rbb en s an initial important event. it s known that expression of several adhesion molecules on b r m en is upregulated !n intlammatorycondtti~ns and cytokine-induction ofthese molecules has been investigated however an ~mparlanl mechanism of adheson molecule induction and hence modulation of tmflammatq cell/en binding might beviral infect onofbrainen wehaveexamined theabilityofmeaslesvirus(human )andherpes simplex virus to infect cerebral endothelial cells. adhesion of syngenelc splenocytes to both vlrally-infected and mock infected cells was determined using a chromiums assay. by h of infection with measles yins, cytopathic effect was eadent. and splenocyte adherence was increased to a mean of % of the control mock-infected aner h mock-infecled expression of the adhesion molecules meca- and mala-lon virally-infected cells was determined possible mechanisms for enhancement of adhesion will be discussed and subsequent implications of virus inlectiun of cerebral en ~n relation to homing of inflammatory cells into the brain. ultrastructulal dbsenamns were made on the mechanism and route of innanmatory cell diapedesis through cerebral vessel walls in an experimental model of can~ned~stempervlrusencephalamyel~t~s ~n the hamster migrating monocyies and lymphacyiesextended pseudopodiawhich contacted, indented and adhwedtoendothelwm. they lhen invaded the endnthel l cells. becoming enveloped by endothelial cytoplasmic processes which re-establishec the continuity of the vascular lining as the migrating cell passed through although mlgratlng cells were frequently seen close to intei endotreiml junctions. they were ever seen wlthln iprctlons. or between adlacent endothelial cellsand therewas noev!denceof opening of mterendothelial tlghtpnmons aherpasslng through the endothel~al layer. cells squeezed through small pores (migration porest in the t h r sub-endothellal basal lamina me present study confirms and extends previous observations on the operation u;#thr the cns of a trans-endothel~al. paraiunclianal route for d apedesis of mflammatary cells. ireland falbology royal viclona hospital, bcl/asl hi bl lrrlnrld a sewice for the blochemtcal dlagnosts of lysosamd torpge diseases has now been o n operalaon m betlast forslx y n a r~ it currently lids assays available for the measurement lysasamal acid hydrolases in plasma. serum. lkwcocvtes. cultured sk nfibroblasts. amnioticfluld and cultured amniatlccellstordlagnos~s of lysosomalstorage diaurders solar patients have been referred from throughout ireland wtth having a positive diagnosis of these. were diagnosed as being hamozygous for a specific lysosomal enzyme deficiency, were identified as having multiple enzyme acl~ciencies (mucolipidosis type -cell disease) and had heterozygote (carrier) enzyme ibvpis t t v latier. w~i e either parents (obligate hetemrygotes) or siblings of homozygotes and one was a heterozyquti' lor x~lirlhrll recessively inherited fabry's disease. in addttlon. prenatal diagnosis has been performed on motherswithafam,iy hi~toryofi-celldiseaseand/orhurler'ssyndmme.oneofthesepmvedto bepositivefor huller tht' resuiic the biochemical investigations of these cases are presented. high levels of circulating immunoglobulins are common ~n liver dlsbase. in alcohol c hver disease depos s of mmunaglobullna(lga) have been found in the liver and therenal mesanglum, and werepart acaseof a oyearoldfemale with a year history of cardiomyopathy and progressivemuscle weakness anerthe birth of her baby shewas subsequentlyflnedwith apacemaker. clin~cally, therewas sevweweaknessof neck flexors with proximodistal weakness tn both arms and mild weakness of htp flexors. the most stnking weaknesswasin herbreathmg muscles therewas no ptosisorfaaal weakness rellexeswere symmetrical and her plantarswere flexor herckwas normal. aquadriceps musclebiopsy revealed abnormal vanatcon o n fibre diameters affecting bothflbrelypes. occasional pink hyalme ~nclus~ons which stainedforacid phosphataseand wlth pas wwe seen i~i both fibre types electron microscopy showed these i n~l~s i o n~ to cwtsist of aggregates nm diameter filamentaenmeshedwithinacentral coreofdenseamorphousmatenal. inotherareastheamorphausmatsnal layas inegutar patches within !he sarcoplasrn mainly at the level of the "z" tine causing disintegration of the sacomere immunoelection microscopy wing collmdai gold showed that the denseamorphous material reacted slrongiyw days in ovo and from day hatchlings readily lorm bath fibroblastic and cartilage colonies in vitro. cells within the cartilage colonies are polygonai in morphology, are separated by a relradile exhaceiiuiar matrix and svnthesise catilage specific pmeogiycans and collagens. as shown histochemicelly, biochemically and immunocytochemically. in contrast, adult chicken bone manow does not form cartilage. instead, the cells appear osteoblast-like and synthesise coilagens typical of bone. a. m. fianagan t. j. chambers &wtmment of h~slopatholosy. sl george's hosptal medlcel sch-i, london. sw ore osteaclastshavebeensuccessfullygeneratedinculturesofmurinehaemopoisticcells. itwouldbeuseful ifasmila, model were available to analyse the mechanisms of regulatlm of human mteaclast formaim ir normal and pathological states. although the osteoclast is present in reverai neoplastic conditions. it is not known whether it forms part of other haemowietic malignancies, lor example, polycythaemia rubra "era or chrmic lymphatic leuhemia. we used strategies based on our experience with murine osteoclastogenesis. however, we have been unabletogeneratefunctional human osteoclastscapableoiresorbingbone m vrtro. large numbersofmultinucleate cells developed in these cultures mese cells did not show a typical panem of reactivity with asteociast-specific manoclonal antibodies, nor did they bind ' sct but rath@r possessed an antigenic profile characteristic of macmphage polykaryons. it is pecullarthat human tissue fails to support osteoclast generation stnce cells of the other haemopoietic lineages were consistently generated in our cuitures. in murine cunuras it is known that a particular shomal cell type is requtred for osteoclastagen@sis; it is possible therefore that this cell population is spamin adult humantissue.miswouldreflecttheiownumberofosteoc asts presentinhuman adults comparedto mice elucidation of conditionssuitableforthe generation mthehuman osleoclastm wtrow~ll heldus understand the mechanisms by which it is regulated in health and disease spaceof one year. the other case s b middie-agedwoman who &=being affected bythe disease far sevaral years developed separate sarcomas in each lower limb dunng a nine month pwiod. each tumour was treated bv pamal amputation but she died from widespread metastases within one year of the first amputationthe distribution of proline- -hydroxylase in a range of human tissues shown by a monoclonal antibody, fib s. smlth, p. revell aepariment of mobd anatomy and bone and h m f research unrt, the london hospw medcal cdl€ge, london. e l fib . a commercially available monoclonal antibady to the b subunit at the enzyme proiine- -hydroxyla~ invalved~ntheproductionofcallagen, wasappl'edtaacetonebxedcryostat sect~onsofawiderangeoftissues. in hver, hspatocytesshowedvarying degrsesof labelling with the antibody, along with spindle shapedcells(sscs) in associated connective tissue in both tonsil and lymph node. sscs in the connective tissue were melled with the antibody along with a number of other cells (lymphocytes) within the germinal centres. in addition to this. perivascularcellsand cells intheconnectivetissue iayerimmediatelybelowtheepithelium also labeliedin sections of tonsil. in skin, very few cells in the dermis were positive for pm ne- -hydroxylase. as were also only a small numbetof epidermal cells. afew chondrocyteswere marked bythisantibody in artlcularcartilageand intervertebral disc,along withswneostwcytesin bone.manycellsinthepleuraandtheepimysiumof normal~keletalmusclew~e posotive. there was no staining at all in specimens of kidney. in addition to these 'normar tissues, examples of a seminoma, a breast carcinoma and a thymoma were examined. each of the turnours showed fib labeliiw of ssc~withinthestromalt~ssue.theseresultsshowthat fib b sele~tivelylabelssscs.whlcharapresumedto be f brobiasts. in a wide ranged both normal and pathological tissues. which produce and maintaln the collagen matnx. proime- -hydoxylase is an enzyme involved in the pmduction of collagen and demonshahle in the rer of fibroblasts. it may therefore be cmsldered as a possible matker for fibroblasts in tissue sections. we have applied the mouse moncclonal antlbdy fib , raised against the $ subunn of prol ne- -hydroxylase. to alcohol fixed paramn embeddedsections and acetonefixedcryostat sect~onsolnormal (n = ) and meumatoid (n = ) synovia. synoviocytes labelled very strongly, but the distribution vaned between samples. insome, mostsynoviocytes weremahed, whilein others ~nlythedeepiayerofcells wasmitive. it isnot known whether thts dillerence ~n prdine- -hydroxylase expression is due to the severity and duration of disease. drug treatment or other factors. it is of interest, however, that fib labelling was not wnflned to the type b (fibroblast-llke) synoviocytes and that the more supficlal type a synoviocytes also contained proline- hydroxylase. spindleshapsdeells i" thesublntlmaleonnectiva tissue were labelledma vanablemanner. inaddltlm to these findings for fibroblastic cells. a small cap of fib posnive cells was seen around lymphoidfollicles. sometimes polansed towards the synovial surface. this enzyme expression by lymphocytes is under further invsstlgatlo". vitronectin s an adhesive glycoprotein which shares several functional similarities with fibronectin. it is a major component of extracellular matnx and plays a role in cell-matrix mteradians, monocyte function. and the coagulation and wmplement systems. we have employed a monoclonal antibody to assess the distribution of vironectin in frozen synovial biopsies from patients with rheumatoid arthntis, ostwnhntis. ankylosing spmdyiitis andtraumatic non-inflammato~ control^ ( cases). lmmunoreactive vltronectin was identified in the synovium. similarto the panem of immunoreactiyefibronectin, itwas located in adensefibrillarpanem surrounding cellsof the synovial lining layer. vitronectin was also associated with fibres throughoutthe sub-intimal connective tissues. being most prolific in fibrotic areas. vltronectin was identified in the sub-endothellal layen of blood vessels. but unlike fibronecttn was not associated with basement membranes. similar distributlon patterns wme observed in all biopsies studied thelocalisation ofvitronectitin in synovial tissues suggestsapossibleroie inanachmmtofcellsto the extracellular matrix and that may be important in the pathophysiology of inflammation and repair manolayercultured articularchondrocytesare known to rapidly losethelrexpressionof collagen typeil.the purpose of this study was to compare the expression of various collagen types in twoand three-dimensional cultures articular chandracytes. in addition. the expression of s- w protein and its alpha and beta subunits was studied in monolayer cui ures. bavine articular chondrocytes were isolated by collagenase digestion from ankle joints and cuhured in monolayerandspheroid culturesin dmemwim % fcs lmmunocytochemical studieswere performed using the indirect peroxidase technique an monolayers affer methanollethanol ( vlv) fixation and on frozen sections of spheroids. the reaction was scored semi-quantitatively as negative (-), weakly (+) moderately (++) ontha basisofthenumbersafcasesrefsiredtoustottheidentificalionofjointcry~talsintissuesectians, webelieve that calcium pyraphosphatedihydrate(cppd) deposnion disease isbeing underdiagnosed byhistopathologists. a crltical light microscopic reviewof cases. all with thediagnosisconfimed bymicroanalysis(energydispersivexrayspectroscopyln thescanning electron microscope, infrared spectroscapywboth) revealed adistinctwefeathery or br"shlikeappearanceinal suchdeposits.thisfeaturewasapparentat lowpower, whileconvincingvisualisation of crystalswithinthedepositswasdiff~cuneven withanoil immersionobiective.thesignof birehingeneeof cppd crystals is more dificult to demonnrate i" tissue sections than synoviai fluids. due to iactois such as stain. fragmentation and heaping up of crystals, buf it can be more readily assessed in unstained sections or following microincineration of the section. in six of our c a m the deposits were exclusively within bone: demonstration depended on relatiyeunderdecalcification: deposits in this position have not previously been recognised. this study thus provides new information an the histological identdication of cppd deposns in tisues and implies heterogeneity in the pathogenesis of such deposits. in human knee osteoarthrosis (oa) overt unicornpartmental disease is frequently accompanied by a macroscopically normal second cornpartmsm. however. there are a number of light micmscopical changes within the latter which might amount to an "early osteoarthrosis". these are chondrocyte pmliferation and chondrone formation. decreased proteoglycan staining and disruption of the chandra-osseous interface with duplication of the tidemark and resorption ofthe cartilage by chondmclasts. tidemark duplication is also a feature of established oa re~resenti~gadistu~ancedfcabif~cat~oninrone ofarticularcartilage.slnceslwp~~t~i~~~~ knowncalciumion transponmoleculs.altsrationm~~sstainingpanernand~ntens~tym~grn beanlndexotearlyo~. sl~~protemstamng was studied in the tiblai plateau cartdage of cases: with unicornpartmental oa. with bicompartmental oa and controls. lntheianerall chondrocytesgavepositivemembrana and cytoplasmic staining (especially in zones e n d ) , d~u s e w e a k m a t~~c~l s t~~~~~g~~~~~e l , s t~o~g p a r i c e l l~i a~a~d~~t~~-t~~t o~~l s t~i~~~g~~~o~~s a~d . in overt and early oa the orderly malrlcal panem was disrupted with prominent patchy staining around chondrocyte clusters. the significance of the change in the matncal distribution ~n oa is unknown and unlikely lo be related to a disturbance of calcification in zone . road, london. wcl m e histogenesis at alveolar son part sarcoma (asps) is still unsenled: nonchromaffin paraganglioma. malignant granular cell myoblastoma, neuml tumour, myogenic tumour. habdomyosarcoma and renin-producing tumoui theories have been proposed lmmunohistochemical studies have yielded discrepant results we recentlyobsewed that the diastase resistant crystalline material in asps was strikingly reactive for vimentin by the immunoperoxidase method on paraffh sections one case. we then studied paraffin sections of seven other cases from our files between and ~ by routine light microscopy, immunohistochemistry and electron microscopy all eight cases were reactlve for vimentin. which highlighted the crystalline materlal as in the first case. vanabie numbers of immunoreactlve cells for desmin were found in three. and smooth muscle specific actin in four. all cases showed some reactivity for neuron specific enolase. and seven of the eight cases reacted for s- protein. the clinical and pathological lea lure^ of cases of synoviai sarcoma have been reviewed. these tumours are aggressive with a poor prognosis. typically affecting young aduns with a long history before presentation and detinltive diagnosis. a high index of suspicion is required if the diagnosis is not to be missed several features, including age, size. site. histological type (monophasicibiphasic), mitotic actlvity and necms~s have been investigated withaview toestablishing possible prognosticindicators. the bestguideto prognosisis assessment mitotic actlvity. size and histological type dld not affen the outcome. in this study the effecl of subcapsular orchidectomy on skeletal metastases from prostatic carcinoma was studied using bone histomorphomatnc paramsten. twenty-eight patients with bone metastases were studied immediately before and for seven months after orchidectomy. tetracycline labelled bone biopsies were taken from metastases and turnour free areas at the beginning and end of study. sixteen of the patients also underwent o~teocla~t inhibition tor six months using disodium pamidronate ( mg i.v. weekly for weeks then anemale weekly for five months) to enable differentiation of the skeletal response to castration. histamarphametnc anslysi~ tumour frse bonerevealed adrapin overall bonevolume. histological analysisofmetastasesshawedadecrease m osteoblastic activity but widespread osteoclast mediated ostealysis. tumour regression and manow recolonisation were present in most c~s b s but malignant foci remained in % of repeat metastatic biopsies, inducing a typical. lacalised disruption of bone metabolism. it is concluded that orchidectomy causes ostmblastic regression blrt induces increased ostboclast mediated bone destruction, which is most pronounced within metastases. although tumour regression and marrow recolonisation usually occur within metastases, active turnour foci anen perstsf. ( ) chairman s. g. silverberg, washington theeffectsafoncogeneson thedeveloping nelvoussystem havebeenstudiedinaneuraltransplant system ~ntats, taking advantage of the extraordinary capacity of fetal cns to differentiate in and fully integrate with, the adult host brain. gene transfer tnto tetal brain cells was mediated by in viiro infection with replication-defsctwe retroviral vectors. fetal rat brain suspensions were then stermtaxically injected into the caudoputamen of adult f rats.animals carrying transplants exposed in vim lo the polyoma medium tantigen developed endothellal hemangiomas in the graft which often led to fatal cerebral hemorrhage within - days alter transplantatm lntroductm of the viral sic gene caused astrocytic and mesenchymai tumors affer latency periods of months.following infection of fetal donor cells with a vector encoding the v-myc oncogene led to the development of only a single embryonal cns tumor whereas exposure to v-ha-ras and v-myc resulted in the rapbd induction of multiple malignant neoplasms. when injected intracerebrallyinto newborn ratsin vivo, complementation of theseongogenes led the development of malignant hemangioendotheliomas, undinerentiated neural tumors andlor leukmia.oncogene transferthus constitutes a challenging new model to assess the effects of translorm~ng genes on the developing nervous system. since the presentahon of molecular genetic evidence supporting the knudson hypothesis m the eariy s there hasbeen increasing interestinthe ideathatasimilartwo-hit mechanism mayaperateln sporadadlccancers therens now evidencbthatlassofonealleleandretentionottheother~nmutated/wildfarmaccuninsparad~ccalono. lung. renal and breast cancers in breast cancer lossof heterorygoslty(l h) has been demanstratedfarseveral alleleson chromosomes l p q and an the short arm of chromosome . loh has also been demonstrated on other chromosomes thegeneral viewisthat loh indicatesthatasuppressorgene may bepresentattheseloct. hawevet in interpreting lohthere sdifficulty indetermining when loh becomessignificant because itsincidencevariesfrom % / depending on the locus examined. in this presentation, knowledge about loh on chromosome p (and q where a new deletion has been recently discovered), and p will be presented and the pathqlenetic significanceand ciinicsl relevanceexamined. lnthe lattercontext anewrapid methodfordetemlnmg alleljc dosage will a l x be described which does not requlre rflp analysis. expelt systems - .e. computer software that can function as cansuhant, decision support system, or process contr~llerhas undergone a rapid development durlng the past decade. applications to histopathology have gnaw some ,ocqdiffuse maiignantmesotheliomas( mm) occurannuailymn. americaat thepresenttime.cumulat~~ely, substantial numbersot lesser-knownfarms of mesothelia turnourare also seen, including serous paptllary tumoun of the peritoneum, well differentiatedpapillarymesothellomas and benign cystic mesotheliomas, but incidencedata are not available.the relativeiyfrequent atypical reactive hyperplmasof w o s a membrane add further vanstytothe range of mesothellai lesions which may present diagnostic problems. the expenence of a u.s.-canadian mesotheliomapanel suggests that dmm vs. metastatic carcinomacontinues to bethe commonest mesatheliamarelateddiagnostieehaliengeandthatthedifficunyandalwayssolved bytheappiicationofspecialstains almostas common are le~ions in which the ditferential diagnosiscenters on dmm us. a reactive process (atypicai mesotheliai hyperplasia, fibrous pleurisy) within the laner group it is particulady difflcun lo obtain a strong consensus opinion, andfoilaw-uphasconfirmedthemaiorityopinioninonly %ofcases. sarcomatousdmmvs.sarcomanosisaiso a notable area of difficuny. most mesothelial lesions tn north amenca. including dmm. are reviewed within a short time of biopsy by a pathologist in a major teaching centre, only a small number ( - %) being subpct to further scrutiny by a panel or ~n the coune of epidemiological or clinical studies however, many cases resurface later ~n a legal sening when the material and clinl~al information s often available. the original diagnosis of the hosoital pathologist s usually endaned by the courts. sharon w. weiss llposarcoma is one of the most common forms of son tissue sarcoma and may be divided into several subwes well-differentiated. myxoid. round cali, phmwpha, and dedifferentiated. this presentation will discuss the ) diagnosticcnteriaof lipblasts ) d agnosisand behaviorof myxoidround cell iiposarcoma andthe ) behavior and incidence of "dedifferentiation" dwell-differentiated ihposarcoma. the diagnosis of lipsarcoma depends in part on the identdication of lipoblasts or pnmitive fat ceils. since iipoblast-l!ke cdls may be seen m a variety of conditions apart from iiposarcoma, strict diagnostic c!i tma must be applied in identlfving these ceiis. these cells contain a hyperchromatic. indented or scalloped, ecmtnc nucleus set in a cytoplasm containing one or more lipid-nch vacudes. these cells must also occur in an appropriate histaiqic background. vanous lesions which may contain ihpoblast-like cells and which may. therefore, mimic liposarcoma include fat necrosis, fat atrophy, silicone granuloma, signet nng carcinomas or melanomas, and a variety of malignant tumors with fixation artifact. myxold androundcell lipsarcomarepresentthemost commanfarmofsarcomaaccuningmearlytamidadunlifeandare commoniy located #n the region of the mlgh and popitteal fossa. although the designation of "myxold and "raund cely suggest two separateturnortypes. they represnt ends d a common h~stologlcspbctnm. myxaid liposeicoma represents the well differentiated end of the spenrum. whereas round cell represents the pcarly differentiated end.however, transitional or mixed forms ex~st. accounting for confusion as to how such tumors should be classlfled mymd liposarcomas are characterized by nodules of bland stellate or rounded cells set n hyaiuronic aod-rich strama. an intricate plexiform vasculature and numerous lipblasts are easily vdentaed. wlth progression to round celliiposarcomathecellsbecomelarger,moreatypicalandthestromalessmyxaid. lipoblastsaremoredlfticultto find and a plexiform vasculature s less apparent. recent work by evans suggests that evaluatlon of myxoidlfound cell ihpasarcomashould includethe percentage of round cell component wdhin atumorsincethisdirectclh/ mtluences prognosis. our policy is to carefully sampk such tmors, submining t section for each centimeter in greatest diameter of the tumor. a rough estimate of the percentage of a round cell component s made. we w a r d tumors havinglessthan faroundcellcomponentasgradei.thosehav,ngbetween & % roundcelicomdonent are considered grade . whereas those with more than % are considered grade . well dmerentiated liposarcoma is one of the most common sarcomas accuning m late adult life. characteristically affecting the deep musclesaftheextremities, theretropentonealspace. and the groin, this tumor ischaractenzed by varying amounts of mature fat interspersed with fibrous bands, atypical hyperchromaric spmdledceiis, and lipoblasts. these lesions are considered low grade sarcomas having a high rate of local recurrence but no abilny to metastaslze. recently evans and ammi eial suggested that well-differentiated liposarcomas occurring m the wbcutaneous tissue and muxles of the extremity cause so hnle morbidity that the use of the term "atypical lipma" should be used rather than ihposarcoma. in contrast. well-differentiated liposarcomas in the retropentoneum cause significant morbidihl and pose a significant risk of death from local disease. thus. nearly all pathologists agree that the term "weii differentiated lipasarcoma" should be retained for these tumors ~n the retroperitoneum. however. none of these recent studies have followed a large number of these tumors for a proianged period at time m order to assess the long term behavior and specihcalty risk that such es ons may progresswith time to a higher grade lesions ( . dedifferentiate). we have recently completed a follow-up study of cases of well ditferentiated iiposarcoma accurringmthemusc~oftheextremity,retropntoneum,andgroin. inalllmatlonsthenskof local recurrence was high. however,dedifferentiationoccurred~n % of casesandwasnotrestnnedtotumors~nanyparticularlacatlon. butcouldbe bestconelatedwiththedurationofthetumor. withmostcasesoccun~ngaflerl mmoreyean thus. dedifferentiation is not a site-dependent phenomenon, as has previously been suggested. but rather a time dependentphenomenon.anhough thesedatado notneee~~arily indicatethatwe should abandon theterm"atwca itpoma" they do indicate the need tor proianged foilow-up of patients wlth wei&-dlfferent,ated ddosarcoma and the small but deflnne risk of dedifferentiation as a long term complication of the disease. key: cord- -tbuijeje authors: villalobos, carlos title: sars-cov- infections in the world: an estimation of the infected population and a measure of how higher detection rates save lives date: - - journal: front public health doi: . /fpubh. . sha: doc_id: cord_uid: tbuijeje this paper provides an estimation of the accumulated detection rates and the accumulated number of infected individuals by the novel severe acute respiratory syndrome coronavirus (sars-cov- ). worldwide, on july , it has been estimated above million individuals infected by sars-cov- . moreover, it is found that only about out of infected individuals are detected. in an information context in which population-based seroepidemiological studies are not frequently available, this study shows a parsimonious alternative to provide estimates of the number of sars-cov- infected individuals. by comparing our estimates with those provided by the population-based seroepidemiological ene-covid study in spain, we confirm the utility of our approach. then, using a cross-country regression, we investigated if differences in detection rates are associated with differences in the cumulative number of deaths. the hypothesis investigated in this study is that higher levels of detection of sars-cov- infections can reduce the risk exposure of the susceptible population with a relatively higher risk of death. our results show that, on average, detecting instead of percent of the infections is associated with multiplying the number of deaths by a factor of about . using this result, we estimated that days after the pandemic outbreak, if the us would have tested with the same intensity as south korea, about , out of their , reported deaths could have been avoided. this paper provides an estimation of the accumulated detection rates and the accumulated number of infected individuals by the novel severe acute respiratory syndrome coronavirus (sars-cov- ). worldwide, on july , it has been estimated above million individuals infected by sars-cov- . moreover, it is found that only about out of infected individuals are detected. in an information context in which population-based seroepidemiological studies are not frequently available, this study shows a parsimonious alternative to provide estimates of the number of sars-cov- infected individuals. by comparing our estimates with those provided by the populationbased seroepidemiological ene-covid study in spain, we confirm the utility of our approach. then, using a cross-country regression, we investigated if differences in detection rates are associated with differences in the cumulative number of deaths. the hypothesis investigated in this study is that higher levels of detection of sars-cov- infections can reduce the risk exposure of the susceptible population with a relatively higher risk of death. our results show that, on average, detecting instead of percent of the infections is associated with multiplying the number of deaths by a factor of about . using this result, we estimated that days after the pandemic outbreak, if the us would have tested with the same intensity as south korea, about , out of their , reported deaths could have been avoided. governments and policymakers dealing with the covid- pandemic will fail in their objectives if their actions are guided by misleading data or subsequent misinformation. the authorities should have reliable estimations of the number of sars-cov- infected individuals. however, there are few attempts to estimate the total amount of infections ( ) ( ) ( ) ( ) ( ) . consequently, health systems face enormous challenges since an unknown and probably a high proportion of all sars-cov- infections remains undetected. moreover, data suggest that infected individuals can be highly contagious before the onset of symptoms and sars-cov- can be also highly contagious in individuals who will never develop any symptoms ( ) ( ) ( ) ( ) ( ) . undetected infections are dangerous because infectious individuals spread the coronavirus in unpredictable ways. undetected infections consist of non-pcr-tested individuals with symptoms and asymptomatic individuals (non-covid- patients) that are likely to remain undetected over all phases of the infection. however, non-pcr-tested individuals with symptoms would tend to auto-select themselves, depending on the severity of their symptoms (from mild to severe), toward treatment and late detection. for this reason, it is important to know the proportion of the infected population which is asymptomatic or has such mild symptoms that self-select them into the group of non-pcr-tested individuals ( ) ( ) ( ) ( ) ( ) . here, regarding the estimation of the number of infections, and for purposes of public health, i advocate the view by amartya sen and martha nussbaum that is preferable to be vaguely right than precisely wrong. the public health problem is that undetected asymptomatic individuals, as well as late-detected sars-cov- infected individuals, increase the risk for vulnerable groups . since there is a transmission channel between the level of detection and the number of deaths, the early detection of asymptomatic infections, pre-symptomatic, and mild covid- cases is a public health concern. moreover, undetected cases also are responsible for the collapse of the health system by numerous aggravated and sometimes unexpected covid- patients requiring treatment in a short period. overwhelmed health care systems reduce the recovery prospects of patients by the lack of treatment, undertreatment, increased risk of mistreatment of all patients, including those with covid- , and also put at unnecessarily risk the health workforce ( , ) . the problem is that many governments formulate their strategies and responses to the pandemic based on figures that they can control. this problem of reverse causality produces contra-productive incentives for governments since public opinion tends to negatively react to the report of the cumulative and the marginal numbers of detected (reported) cases. the contradiction is that something good, such as the increase in the testing efforts by governments can be perceived by the public opinion as something bad (due to the increase in detections). worldwide, the media communicates confirmed cases and deaths as the relevant parameters to take into consideration when assessing the evolution of the pandemic. this is a mistake since this emphasis discourages governments from decidedly pushing for mass testing with the obvious consequence of an increased number of detected cases (although, as shown in this paper, there is a theoretical mechanism relating more testing with saving lives). more sophisticated observers would use the crude and adjusted case fatality ratios to assess the pandemic evolution. however, international comparisons show that crude and adjusted case fatality ratios are highly heterogeneous and their use can be misleading ( , ) . for instance, the simple division of the cumulative number of deaths by the cumulative number of confirmed cases underestimated the true case fatality ratio in past epidemics ( , ) . although nowadays many case fatality ratios have been estimated in this pandemic correcting many of the observed past biases ( ) ( ) ( ) , they are still depending on testing efforts made by countries. the problem with heterogeneous case fatality ratios (different proportions of all cases that will end in death due to methodological differences on the denominator) is that they are not anchored at any exogenous information that allows researchers to perform international or territorial comparisons based on credible, and transparent assumptions. consequently, to rely on the number of confirmed cases makes international comparations impossible since governments have shown to implement highly heterogeneous sars-cov- testing strategies ending up in different levels of location-based under-ascertainment. in an attempt to solve the mentioned problem, we anchor our analysis in the cumulative number of deaths, which is a statistic much more difficult to alter, in free societies, than the number of sars-cov- tests . we use this information together with the newest and sound estimates of the age-stratified infection fatality ratios (ifrs) provided in the recent sars-cov- related literature. in particular, we base our analysis on the ifr of . % reported in verity et al. ( ) . this ifr is very close to the . % reported in a meta-analysis of ifr estimates from a wide range of countries, and that were published between february and april of ( ) . we also assume orthogonal attack rates of the infection which is also supported by recent literature ( ) . by weighting the age-stratified ifrs by the country population agegroups shares in each country, it is possible to obtain countryspecific ifrs. the relevance of this study is -fold: firstly, the estimation of the true number of infections includes not only confirmed cases but covid- undetected cases, as well as sars-cov- infected individuals without the disease, or in a pre-symptomatic stage. therefore, to provide an estimation of the true number of sars-cov- infections is of more utility than to be only informed about the number of confirmed infections. this is because confirmed cases depend on the testing efforts that can be altered or even manipulated by governments. moreover, one can compare the true estimate of infections with the number of covid- patients that require hospitalization. such ratios can contribute to predicting, with exogenousto-government information, shortages of the health systems. secondly, the estimation of the true number of sars-cov- infections allows us to estimate the detection rate of the infection, which is a measure of the performance of health systems and governments while facing the pandemic. one can expect that higher levels of detection of sars-cov- infections, which includes asymptomatic population, and those in their early stages of the infection (which are more infectious) can reduce the risk exposure of the susceptible population with relatively a high risk of death, that is, the elderly and those individuals with preexisting conditions ( ) . accordingly, a highly neglected statistic, such as the detection rate should be considered highly relevant from the public health point of view. thirdly, in this paper, we test the hypothesis that higher detection rates can save lives while providing a measure of this impact (having in mind that is preferable to be vaguely right than precisely wrong). thus, this study aims to quantify the importance of testing while providing empirical support to the utility of implementing massive sars-cov- tests. overall, this study argues that it is crucial to compute the evolution of the cumulative number of estimated sars-cov- infected individuals, and subsequently, the cumulative detection rates. this information would provide public health managers and governments the incentives to improve detection rates, rather than to the opposite. moreover, the identification strategy can be used at lower levels of aggregation, such as regions, provinces, and municipalities to improve responses to the pandemic, including the planning of selective lockdowns or spatial-selective enhancements of the installed critical care units. in summary, this study proposes a baseline estimation of the number of sars-cov- infections and detection rates based on current information and transparent assumptions. however, the assumptions discussed later in this paper can be later modified to match the current scientific available evidence and countryspecific developments and contexts. for this research, we use the cumulative number of deaths and confirmed cases in the world and by country, published by ourworldindata.org, a project of the global change data lab with the collaboration of the oxford martin programme on global development at the university of oxford . age-stratified demographic proportions of the population were obtained from the un population data ( ) , the estimated ifrs correct for many types of bias. the infection fatality ratios were obtained after combining adjusted case fatality ratios with data on infection prevalence amongst individuals returning home from wuhan in repatriation flights. and death. since this number is unknown, we approach to this number using the sum of the median incubation period as reported in lauer et al. ( ) , and the mean number of days between the onset of symptoms and death as reported in verity et al. ( ) . for our empirical exercise, we rely on world development data by the world bank (gdp per capita and health expenditure as a share of the gdp) and in world health organization data for bcg vaccination . in this study, our regression analysis relies on data for countries covering above % of the world population. the remaining countries were excluded because they either do not have significant mortality figures (for instance uruguay, monaco, bermuda, etc.), or full data. in this study, we rely on a very simple rationale. at a given point in time, the cumulative number of deaths should be a proportion of the cumulative number of infections somewhat in the past. but how many days in the past? the answer lies in the sum of the number of days of incubation and the number of days between the onset of symptoms and death. this rationale follows a report focusing on the most-affected countries by the pandemic in the world ( ) . however, in this paper, we deviated from the mentioned report by using the key parameters in a different way, which translated into a different estimation of the number of infected individuals. on average, deaths occur ∼ days ( . days with % credible interval [cri] . - . ) after the onset of covid- symptoms ( ) , while the incubation period of covid- has been estimated in about days ( . days with % ci, . - . ) as reported in lauer et al. ( ) . thus, by comparing the cumulative number of deaths at time t in country i (cdeaths (i,t) ) with the country-specific infection fatality ratio (ifr i ), which is assumed constant over time, it is possible to obtain a rough approximation of the cumulative number of sars-cov- infections days ( days + days) in the past (cinfected (i,t − ) ) . https://data.worldbank.org/indicator/ (accessed april , ). https://apps.who.int/gho/data/node.main.a ?lang=en (accessed april , ). differently to bommer and vollmer ( ), we include the incubation period while avoiding the subtraction of the number of days between the onset of symptoms and detection to the relevant lag period. these differences explain the discrepancies between both set of estimates. moreover, by combining the cumulative distribution function of the sars-cov- incubation period as reported in lauer et al. ( ) and an approximation of the gamma distribution with correction for epidemic growth of the days between the onset of symptoms to death as reported in verity et al. ( ) , one can calculate a vector of probabilities to weight the cumulative number of deaths required in equation . the weighting vector goes from t − (representing the proportion of deaths of those who experienced day between infection and the onset of symptoms, plus one day from the onset of symptoms to death) to t − (representing the proportion of deaths of those who experienced days between infection and the onset of symptoms and days between the onset of symptoms to death). the smoothed approach produces almost an identical estimation of the cumulative number of infected individuals. given that and for the sake of simplicity, we prefer to use the non-smoothing approach. additionally, we use the ratio between the cumulative number confirmed (detected) cases at time t − in country i (cconfirmed (i,t − ) ) and the cumulative number of infected individuals (cinfected (i,t − ) ) at time t − in country i as a rough measure of the cumulative rate of detection of sars-cov- infections at time t − . in order to estimate the country-specific infection fatality ratio for country i used in equation , we weight the agestratified infection fatality ratios reported in verity et al. ( ) , by the age-group population shares of country i. the calculation of the age-stratified infection fatality ratios relies on two assumptions that can be modified when producing point estimates of the number of individuals affected by a sars-cov- infection. firstly, it assumes that there are no crosscountry differences in the average overall health status of the population, comorbidity, or in the soundness of the different health systems. in absence of standardized country-specific information of these variables, this assumption is convenient although, at first sight, it can be considered a restrictive one. however, it is quite the opposite since, in richer countries with higher proportions of elderly populations, the estimated infection mortality ratios are likely to be overestimated. if so, our estimates of the infected population represent a lower limit of the true number of infections. the second assumption is that the attack rate of the coronavirus is unrelated to the age and sex of susceptible individuals. this is in concordance with the evidence in respiratory infections in previous pandemic processes ( , ) . then, the distribution of ifrs across countries reflects the "fixed" lethality of the virus associated to a varying demographic structure of the population across the world. figure presents the calculated infection fatality ratios for the world, and for countries in which the lethality of the pandemic has been more significant. recently, a cross-sectional epidemiological study with a super-spreading event in the county of heinsberg in germany offered the opportunity to estimate the infection fatality ratio in the community ( ) . the estimated infection fatality ratio was . %. although this number is surprisingly low when compared with other estimations, for instance, the used in this study for germany ( . %), it is not evident that the true infection fatality ratio is closer to . % rather than . %. this is because there can be local factors that explain the discrepancy as pointed out in the heinsberg study. amongst these factors, it might be mentioned comorbidity gaps, ethnic differences, the quality and coverage of the health systems, climatic differences, immunization levels, etc. . consequently, it might be necessary to assess the consequences of using an overestimated infection fatality ratio (that is, an ifr closer to the one reported in the heinsberg study, or others inferred from seroprevalence data ( ). the answer is that the number of infections would be underestimated, and that detection rates would be overestimated (since the infection fatality ratio is on the denominator). an overestimation of the detection rates reduces the validity of international rankings based on this figure. however, from the public health point of view, this would be irrelevant since, as discussed later, all countries should increase their detection rates of sars-cov- infections as much as possible. to investigate whether improving the detection rates of sars-cov- infections is potentially associated to save lives, we use a parsimonious synchronic cross-country multiple linear regression . that is, we use the information reported , , and days after the confirmation of the first sars-cov- infections, which corresponds to the pandemic outbreak (po). at a given pandemic phase, we regress the natural logarithm of the cumulative number of deaths in country i, ln deaths i , on their estimated detection rates (dr i ) and its squared to assess whether there is a non-linear relationship of this conditional correlation . the four parsimonious regressions have a demographic control that corresponds to the estimated country-specific infection fatality ratio (ifr i ). this is a non-endogenous control since it only captures the impact of demography (population shares by age-groups) on the number of deaths and not the reverse. the regressions control for the population size of the country i in its natural logarithmic form ln(pop i ). this control is necessary because the share of the susceptible population remains persistently at relatively higher levels in more populated countries when compared with the less populated ones. we also include the natural logarithm of the number of confirmed sars-cov- infections in each country ln(confirmed i ). this is a measure of the persistence of the mortality process while controlling for cross-country differences in their absolute testing number of days since the pandemic outbreak. on the contrary, a non-synchronic estimation neglects the pandemic phases but considers as reference period the calendar day. output tables without the square of the detection rates are available in the supplementary material. performances. the regressions also control for the economic performance of a country by means of the natural logarithm of the per capita gross domestic product ln(gdppc i ) . we also include the current health expenditure as share of gdp in (healthshare i ). this control is needed to account for relative resource-dependent differences in the coverage/quality of the health systems around the globe. finally, we use available data to explore a possible association between bcg vaccination and aggravated cases of covid- , and deaths [a relationship which is being investigated in some clinical trials ( )] . the evidence is still inconclusive because the argued existence of uncontrolled confounders ( ) ( ) ( ) ( ) ( ) . however, if these confounders exist, they can bias the relationship between sars-cov- detections rates and the cumulative number of deaths. based on this argument, we include a raw of dummies capturing the degree of bcg vaccination coverage as follows: bgc group : no mandatory vaccination (up to . % coverage), bgc group : to . % coverage, bgc group : to . %, bgc group : to . %, and bgc group : to %. the reference category is bcg in constant international dollars with the same purchase power. https://apps.who.int/gho/data/node.main.a ?lang=en (accessed april , ). group . an alternative approach is used to indirectly investigate the conditional association between detection rates and sars-cov- related deaths. instead of using the detection rates and its square, we use the natural logarithm of the estimated number of infections ln(infections i ) while dropping from the equation the natural logarithm of the number of confirmed (detected) sars-cov- infections as follows: regarding the statistical inference, significance tests rely on a heteroscedasticity consistent covariance matrix (hccm) type hc which is suitable when the number of observations is small ( ) . although in the presence of heteroscedasticity of unknown form, ordinary least square estimates are unbiased, the inference can be misleading due to the fact that the usual tests of significance are generally inappropriate ( ) . additionally, we estimate the same set of equations (the main specification and the robustness specification , , and days after the pandemic outbreak) using robust regressions. we do this because we have the concern that parameter estimates may be biased if, in some countries (outliers), the report of the cumulative number of deaths has been involuntarily altered or even manipulated. robust regression resists the effect of such outliers, providing better than ols efficiency when heavytailored error distributions exist as it can be likely the case ( ) . on july , the estimated infected population reaches about million individuals (figure a) . this number is about times larger than the reported number of confirmed cases (about . million represented by the dashed line). note that the number of infections is estimated based on detection rates calculated days in the past. thus, for the period t − to t, the number of sars-cov- infected individuals are estimated using the estimation rate as in t − . therefore, the estimation of sars-cov- infected individuals can be biased if detection rates deteriorate or improve considerably within this time span. the accuracy of our estimations can be assessed by contrasting them against to those provided by population-based seroepidemiological studies. there are some studies of this type focusing on restricted geographical areas, for instance, in germany and switzerland ( , ) . however, to the best of our knowledge, there is only one country level and large scale population-based seroepidemiological study performed in spain ( ) . the ene-covid study in spain finds that, on may, % of the population would test igg positive against sars-cov- . it implies that about . million individuals were infected by sars-cov- . similarly, in our study we estimated on may an infected population of about . million individuals. this evidence suggests that our method can be a suitable alternative when population-based seroepidemiological studies are not available, which is frequently the case. here, it is important to recognize that, from the public health point of view, it is preferable to be vaguely right than precisely wrong. on may, spain confirmed only , cases (about % of all estimated infections). at that time, it would have been convenient that public health authorities and the public opinion would have the information that, for each confirmed case, there were significantly much more individuals spreading the infection in unpredictable ways. back to the global estimates, by comparing the cumulative number of estimated infections with the cumulative number of confirmed (detected) cases, we obtain, at the end of june , a global detection rate of about % ( figure b) . the global detection rate curve shows an u-shape with a minimum at the beginning of the third week of march reaching only . %. the last data suggest that detection rates are steadily increasing. moreover, the semi-logarithmic plot in figure a suggests that the infection stopped spreading at its maximum pace approximately during the third week of march, but unfortunately, it increased its speed again around the last week of june. the world distribution of the number of deaths, the estimated number of sars-cov- infections, and the detection rates of sars-cov- infections across the world are displayed in figures - , respectively. since the global estimates are no more than an aggregation of the trajectories made by the different countries in the world, we investigate how heterogeneous the detection rates across countries are. table presents this information in a synchronic way. the rankings compare countries in the same phase of their respective pandemic processes, that is after , , , , , and days after the confirmation of the first sars-cov- infections (pandemic outbreak). this approach allows us to perform such an international comparison. at a first sight, it is noteworthy the fact that each of the first countries ranked on the top by the initial detection rate ( days after the beginning of the pandemic outbreak) does not accumulate more than deaths days after initiating their pandemic processes. thus, it seems to exist a strong correlation between detection rates and the cumulative number of deaths for a given stage of the pandemic process. countries with high counts of deaths ranked very badly in their initial detection rates. for example, the us, spain, italy, uk, france, and belgium ranked in place , , , , , and , out of countries listed in the ranking. a second conclusion is that the relative improvement of detection rates over time, that is, , , , , and days after the beginning of the pandemic processes, does not alter the fact that those countries are still ranked the worst in terms of deaths. that is, improving detection over time has declining returns to scale when comes to save lives. the depicted relationship between detection rates and the cumulative number of deaths remains almost unchanged when using non-synchronic data as of may in table . this table mixes information of countries at different stages from their pandemic processes. so, it must be interpreted with caution. although efforts to increase detection have been significative in in table , we present the non-synchronic ranking as of june. the us is in place , spain , italy , belgium , uk , and france . it is noteworthy that, except for russia, none of the first countries in this ranking have accumulated more than , fatalities on june. more importantly and despite the incredible efforts to increase the tests amongst the more developed countries, none of them were able to detect more than % of the estimated infections (the us detected . % on june). it implies that testing efforts need to be deployed at the first stages of the pandemic process due to its cumulative nature. countries are ranked by the detection rates of sars-cov- infections as of may. source: own elaboration. frontiers in public health | www.frontiersin.org countries are ranked by the detection rates of sars-cov- infections as of june. source: own elaboration. frontiers in public health | www.frontiersin.org table a ). (b) contains all countries (in table a ) whose pandemic processes have more than days since the po. the dashed fitted line excludes south korea (kr). source: own elaboration. show that moving over time from relatively low to relatively high cumulative detection rates is unlikely and probably very expensive. this is due to the over proportional efforts needed to expand testing relative to the exponentially growing infections at the early stages of the pandemic. consequently, from the public health point of view, it is much more advantageous, technically, and economically feasible, to implement mass testing from the very beginning of the pandemic process. to achieve this goal, health authorities and governments would require understanding the linkages between the cumulative detection rates and the minimization of the pandemic related fatalities and economic damage. in this analysis, we show the unconditional relationship between detection rates and deaths. the fitted lines in figure are obtained after regressing the natural logarithm of the cumulative number of deaths in the country i on their estimated cumulative detection rates (dr i ). the results strongly suggest a negative relationship between detection rates and the cumulative number of deaths. this strong negative slope is in concordance with the hypothesis that, by detecting a higher proportion of the sars-cov- infected population, many lives can be saved, in particular, the lives of the elderly and those individuals with preexisting conditions. the strong association between the number of deaths and the estimated cumulative detection rates remains significant , and days after the po. these associations are shown in figures a,b , respectively. figure shows the relationship between detection rates ( and days after the po) and deaths days after the po. this descriptive result is of interest since it suggests that, unconditionally, early detection is associated with death outcomes days after the po to a greater extent than the contemporary detection rates, that is, days after the po. although this information suggests the existence of a strong relationship between detection rates and the cumulative number of deaths, this slope may be confounded by the variables mentioned before. thus, in the next section, we show the results of our conditional analysis as described earlier. our results in table show that higher detection rates are associated with a reduction in the number of deaths after controlling for demography (age-structure of the population and population size), economic performance (gdp per capita), and table a ) whose pandemic processes have more than days since the pandemic outbreak. source: own elaboration. the relative resources that the economies devote to their health systems. over time, the cross-sectional regressions increase in explanatory power, from a r-squared of . in model to . in model . based on these results, figure shows a strong conditional gradient between detection rates and the cumulative number of deaths. for instance, for a hypothetical country with average and constant endowments, the cost in terms of deaths of detecting % vs. % is about . natural logarithm points which corresponds to exp . = . . that is, the average country detecting % is associated with a number of deaths about . times higher when compared with the same country detecting % of all sars-cov- infections. to put this result in perspective, let us simulate what would be the number of deaths in the u.s., if instead of detecting . % days after the pandemic outbreak, the country would have detected with the same intensity as south korea ( . %). evaluating the number of deaths at the endowments of the u.s, the country would have fewer deaths by . natural logarithm points. it means that the current u.s deaths are now . times higher than they would be if the country would have tested with similar intensity as south korea. since the number of deaths days after the pandemic outbreak reached , , detecting at the rate of south korea would have saved about , lives in the u.s. at that time. finally, looking at the regression coefficients in table , it is noteworthy the fact that during the pandemic outbreak, a % higher detection rate is associated with more lives saved than a % increase in the health expenditure over the gdp. our results also suggest that the number of deaths, rather than depending on the relative solvency of the health system, could depend in a greater extent on the size and opportunity of the testing efforts. the conclusion is the more tests the better. although in this study we employed an economics inspired approach to figure out the importance of testing, our findings are also endorsed by recent medical literature on coronavirus as well as by another economics inspired models providing support to a causal relationship between detection and saving lives ( ) ( ) ( ) ( ) . robust regressions provide estimates that are close to the ones reported in table . consequently, it is unlikely that the results reported in this study are outlier driven. additionally, results are robust to heteroscedasticity of unknown form for small samples. nevertheless, results should be interpreted with caution. the few observations available for the regressions and lack of data does not allow to rule out the possibility that there are omitted variables that have the potential to bias the results. it is important to keep in mind that results can be biased if omitted variable problem exists. that is, there are variables that are correlated with the explained outcome but at the same time they are also correlated with the explanatory variables of interest. for instance, one can think in countries implementing lockdowns because lower detection rates standard errors in parentheses. significance levels: ***p < . , **p < . , *p < . . source: own elaboration. (argentina), or relaxed social distancing rules because higher detection rates (australia). nevertheless, these non-observed variables yield to an underestimation of the true association between detection rates and the cumulative number of deaths. thus, detection matters. in this study, we have proposed a method to estimate the number of sars-cov- infections for the globe and also for all major countries covering more than % of the world population. on june , we find that, worldwide, about million individuals have been infected by sars-cov- . moreover, only about out of these infections have been detected. we find that detection rates are very unequally distributed across the globe and that they also increased over time from about % during the second and third weeks of march to about % on june . in an information context in which population-based seroepidemiological studies are not available, this study shows a parsimonious alternative to provide estimates of the number of sars-cov- infected individuals. by comparing our estimates with those provided by the ene-covid study in spain, we confirm the utility of our approach keeping in mind that from the public health point of view, it is preferable to be vaguely right than precisely wrong. in order to provide reliable estimates of the number of sars-cov- infections and of the cumulative detection rates, it is necessary that governments provide real-time information about the number of covid- deaths. this study supports the view that an accurate communication of the fatality cases can have consequences on the development of the pandemic itself. thus, it is also a call for allowing international comparison following who international norms and standards for medical certificates of covid- cause of death and international classification of diseases (icd) mortality coding. additionally, in our empirical analysis, we have presented parsimonious evidence, that higher detection rates are associated with saving lives. our conditional analysis shows, for example, that if the us would have had the same detection rate trajectory as south korea, about two-thirds of the reported deaths could have been avoided (about , lives). we find that detection rates at the very early stages of the pandemic seem to explain the great divergence in terms of deaths between countries. moreover, we showed evidence that moving from relatively low to high cumulative detection rates (and thus saving lives) is unlikely and difficult. this is probably due to the high level of efforts needed to expand testing relative to the exponentially growing infections at the early and middle stages of the pandemic. thus, from the public health point of view, it is better to deploy testing efforts at the first stages of the pandemic process. to do this would be much more advantageous, in terms of saved lives, but also it would be technically, and economically feasible. already, many developed countries with well-developed health sectors were not able to avoid unnecessary deaths by their inaction in terms of promoting mass testing to counter the pandemic outbreak at early stages. to achieve the goal of implementing mass testing from the very beginning of the pandemic outbreak, governments need to understand the consequences of not doing that. thus, the evidence presented in this paper offers a rigorous macro-level linkage between detection rates and the cumulative number of deaths which may be useful in future pandemics. this evidence also supports the implementation of mass testing in the likely coming secondary pandemic outbreak (so-called second waves). further research should be devoted to understanding why the detection capacity in many advanced countries was too weak, late, and also so weakly correlated (if correlated) with the income levels. in this paper, we claim that governments have incentives against test because the public opinion tends to primarily react to the report of the cumulative and the marginal numbers of detected (reported) cases. the contradiction is that something good, such as the increase in the testing efforts by governments, can be perceived by the general public as something negative (due to the increase in detections). in consequence, are low detection rates in developed countries simply a management failure, or are there long-run incentives that promoted this behavior among many rich countries? it is clear that during the ongoing pandemic, improving detection rates is a race against time, but are there institutional and/or technological constraints that hamper detection improvements that can save lives? all these questions are relevant for this and future pandemics. this study claims that all countries in the world should be able to respond to a pandemic outbreak with massive testing in the very short run. this would be an efficient approach since it is also likely that higher detection rates are also associated with a lesser impact of the pandemic on the economy. the raw data supporting the conclusions of this article will be made available by the authors, without undue reservation. cv conceived this research, performed the background work, collected the data, performed all statistical analyses, and wrote the paper. substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (sars-cov ) inferring the number of covid- cases from recently reported deaths fundamental principles of epidemic spread highlight the immediate need for large-scale serological surveys to assess the stage of the sars-cov- epidemic estimating the number of sars-cov- infections in the united states estimating the effects of non-pharmaceutical interventions on covid- in europe salivary glands: potential reservoirs for covid- asymptomatic infection sars-cov- viral load in upper respiratory specimens of infected patients investigating the impact of asymptomatic carriers on covid- transmission. medrxiv covid- : identifying and isolating asymptomatic people helped eliminate virus in italian village presumed asymptomatic carrier transmission of covid- clinical characteristics of asymptomatic infections with covid- screened among close contacts in nanjing characteristics of covid- infection in beijing asymptomatic novel coronavirus pneumonia patient outside wuhan: the value of ct images in the course of the disease asymptomatic carrier state, acute respiratory disease, and pneumonia due to severe acute respiratory syndrome coronavirus (sars-cov- ): facts and myths gender differences in patients with covid- : focus on severity and mortality sars-cov- : virus dynamics and host response the origin, transmission and clinical therapies on coronavirus disease (covid- ) outbreak-an update on the status clinical characteristics of coronavirus disease in china review of the clinical characteristics of coronavirus disease (covid- ) practical recommendations for critical care and anesthesiology teams caring for novel coronavirus ( -ncov) patients covid- : protecting health-care workers potential biases in estimating absolute and relative case-fatality risks during outbreaks assessing the severity of the novel influenza a/h n pandemic epidemiological determinants of spread of causal agent of severe acute respiratory syndrome in hong kong estimates of the severity of coronavirus disease : a model-based analysis real-time estimation of the risk of death from novel coronavirus (covid- ) infection: inference using exported cases real-time tentative assessment of the epidemiological characteristics of novel coronavirus infections in wuhan, china medical certification, icd mortality coding, and reporting mortality associated with covid- a systematic review and meta-analysis of published research data on covid- infection-fatality rates the incubation period of coronavirus disease . (covid- ) from publicly reported confirmed cases: estimation and application average detection rate of sars-cov- infections is estimated around nine percent household secondary attack rate of covid- and associated determinants infection fatality rate of sars-cov- infection in a german community with a super-spreading event estimation of sars-cov- infection fatality rate by real-time antibody screening of blood donors. medrxiv bacille calmette-guérin (bcg) vaccination and covid- bcg vaccines may not reduce covid- mortality rates time-adjusted analysis shows weak associations between bcg vaccination policy and covid- disease progression significantly improved covid- outcomes in countries with higher bcg vaccination coverage: a multivariable analysis covid- related mortality: is the bcg vaccine truly effective? using heteroscedasticity consistent standard errors in the linear regression model how robust is robust regression seroprevalence of anti-sars-cov- igg antibodies in prevalence of sars-cov- in spain (ene-covid): a nationwide, population-based seroepidemiological study asymptomatic transmission during the covid- pandemic and implications for public health strategies prevalence of sars-cov- infection in residents of a large homeless shelter in boston covid- epidemic in switzerland: on the importance of testing, contact tracing and isolation an economic model of the covid- epidemic: the importance of testing and age-specific policies. crc tr discussion paper series crctr _ _ the author would like to thank and acknowledge dr. carlos chavez for comments of a very early version of this paper. i would also like to thank m.sc. andrea torres for their comments on the implications of this research. the author also would like to recognize the suggestions and comments provided by the participants at the doctoral seminar at facultad de economía y negocios, universidad de talca. the author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.copyright © villalobos. this is an open-access article distributed under the terms of the creative commons attribution license (cc by). the use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. no use, distribution or reproduction is permitted which does not comply with these terms. key: cord- -kwytbhe authors: djurović, igor title: epidemiological control measures and predicted number of infections for sars-cov- pandemic: case study serbia march-april date: - - journal: heliyon doi: . /j.heliyon. .e sha: doc_id: cord_uid: kwytbhe background: in this paper, we are studying the response of the serbian government and health authorities to the sars-cov- pandemic in the early stage of the local outbreak between mar. (th) and apr. (th), by predictive numerical models. such a study should be helpful to access the effectiveness of measures conducted to suppress the pandemic at a local scale. methods: we have performed extrapolation of the number of sars-cov- infections with the first stable set of data exploiting exponential growth (linear in logarithmic scale). based on obtained coefficients it is performed prediction of a number of cases until the end of march. after initial exponential growth, we have changed predictive model to the generalized gamma function. obtained results are compared with the number of infections and the prediction for the remainder of the outbreak is given. findings: we have found that the daily growth rate was above . % at the beginning of the period, increased slightly after the introduction of the state of emergency and the first set of strict epidemical control measures. it took about days after the first set of strict measures to smooth daily growth. it seems that early government measures had an only moderate impact to reduce growth due to the social behavior of citizens and influx of diaspora returning to serbia from highly affected areas, i.e., the exponential growth of infected persons is kept but with a reduced slope of about - %. anyway, it is demonstrated that period required that any measure has effect is up to days after introduction, firstly to exponential growth with a smaller rate and after to smooth function representing the number of infected persons below exponential growth rate. conclusions: obtained results are consistent with findings from other countries, i.e., initial exponential growth slows down within the presumed incubation period of weeks after adopting lockdown and other non-pharmaceutical epidemiological measures. however, it is also shown that the exponential growth can continue after this period with a smaller slope. therefore, quarantine and other social distancing measures should be adopted as soon as possible in a case of any similar outbreak since alternatives mean prolonged epidemical situation and growing costs in human life, pressure on the health system, economy, etc. for modeling the remainder of the outbreak generalized gamma function is used showing accurate results but requiring more samples and pre-processing (data filtering) concerning exponential part of the outbreak. we have estimated the number of infected persons for the remaining part of the outbreak until the end of june. the second decade of the xxi century repeats the second decade of the xx century with global pandemic causing significant loss of human life all over the world, consequences for patients, forced change of lifestyle for a large portion of the human population, economic losses, and many other side effects. the actual sars-cov- pandemic has an outbreak at the end of in chinese city wuhan (province hubei) with million citizens [ ] . from jan. rd chinese government established strict quarantine rules over city wuhan and some neighboring areas. in total, more than million citizens in hubei and neighbor regions were under strict measures [ ] . besides, entire china was affected by different epidemical and social distancing measures. major part of measures applied on jan. rd resulting in a peak number of novel cases on feb. th . statistics are announced daily by the national health commission of the people republic of china [ ] . note that on feb. th china revised previous estimates of the number of cases giving a significant increase for that date. from our point of view, this is a statistical adjustment only and we can still assume that the maximal number of novel cases is on feb. th , i.e., days after adopting strict quarantine and epidemiological measures. within this interval of days (a little bit shorter than the presumed incubation period of days) average increase in novel cases was . % while the average period required for doubling the number of patients was less than . days. since chinese measures were strict but effective curbing one of the largest infective disease outbreaks in history being contained within approximately two months they can be considered as state-of-art for the sars-cov- pandemic. therefore, all other politics in epidemical control should be compared with the chinese efforts to get insight if it is possible to achieve similar or even better results with measures less affecting everyday life. in this paper, we are considering predicting the number of patients in serbia based on exponential and generalized gamma models and estimating time when local epidemical control measures give benefit in the smoothing growth curve and suppressing exponential epidemic outbreak. as can be expected, predicting the second part of the outbreak by the generalized gamma function is less accurate comparing the first part modeled by the exponential growth. anyway, the generalized gamma function model is also useful giving a good way to estimate the effect of epidemical control measures to the process dynamic. in this section, we are giving basic background information on serbia of interest for the study, i.e., information on demographics including diaspora and aging population and travel connections. in section ii.b basic information of the sars-cov- spread in serbia with the timeline of key epidemical control measures conducted to contain the outbreak are given. detailed information on the epidemical control policies performed is given in the appendix. serbia is a landlocked country on the balkan peninsula in southeast europe with an estimated . million citizens excluding kosovo* (un resolution ). almost ¼ of the population inhabits the greater belgrade area while about % is urban population. serbia has borders with four members of the european union (hungary, croatia, bulgaria, and romania) and four non-members (west balkan countries) north macedonia, montenegro, bosnia and herzegovina, and albania. the main international roads are a and a connecting west and central with southeast europe and further toward asia minor. there are three international airports with the main nikola tesla airport, belgrade, serving as a hub of air serbia, a member of the etihad group. also, it is a hub for aviolet company with seasonal charters and wizzair. this airport served about . million passengers in and more than aircraft. there were direct connections from belgrade to countries on four continents as of jan. th , ( figure ). konstantin the great airport in niš is significantly smaller serving flights to only a dozen destinations with low flight frequency. serbia has a large diaspora with several million persons. census of world countries counted more than million serbs in the diaspora, but some estimates are between and million persons of serbian origin worldwide. also, more than million serbs are living in neighboring countries and slovenia. many international companies are operating in serbia, most notable are chinese but the direct spread of the sars-cov from china to serbia was not detected. serbian population is among the oldest in the world with an average age of . years and life expectancy . years. according to the serbian statistical office, more than % of citizens are older than that is a vulnerable category to the sars-cov- . here, only the main events and time-line of main epidemical control measures are listed while a more detailed description is given in the appendix. serbian authorities on jan. th established measures preventing the spread of sars-cov- following the same strategy previously used in the outbreaks of sars, ebola, mers, avian flu, swine flu, etc. this was mainly related to the control of passengers coming from china and installing thermal cameras at some gates of the airport belgrade (for details see appendix). the number of infections is announced by the ministry of health and institute for public health "dr. milan jovanović batut" (web page covid- .rs). it is announced twice a day at am and pm until mar. th and after that only once per day at pm. we have used these data as official (ground truth) reflecting the situation with infections in the country. the number of tests conducted daily increased gradually. before mar. rd it was sporadic based on symptoms, arrivals from certain geographical arrays, tracking contacts of existing cases, etc. in total, there were only tests until mar. rd . however, the number increased with more cases emerging, and within the next seven days total number of tests increased to on mar. th . in the second half of april, the number of tests increased heavily so in may there were more than tests per day. the first person infected by the sars-cov- in serbia was announced on mar. th when the majority of neighbor countries already had infected persons. three days later virus was confirmed to another person and after that the number of cases started gradually to increase ( figure ). the first significant growth in the number of infections was on mar. th with an increase from to cases within hours. on mar. serious event aggravating the epidemiological situation was the return of persons from diaspora mainly from west european countries, some of them with an extremely difficult outbreak at that time. according to authorities only within hours before the state of emergency more than persons returned to serbia and in march more than citizens were coming back ranging from students at foreign universities, persons with temporal status in foreign countries, etc. such social dynamics significantly deteriorated the epidemiological state in serbia especially before the establishment of mandatory self-isolation or quarantine for days for all arrivals. the semi-logarithmic plot of discovered infections in serbia is shown in fig. (b) . this is the usual behavior of many models of infective illness epidemic outbreaks [ ] . after a small number in the first several day number of cases grows from to within days from mar. th to mar. th . to compare local data with the other countries consider figure from [ ] (figure ) where the number of infections is compared in a semilogarithmic scale against daily growth in the number of infections from detected cases up. it can be seen that a daily percentage increase is going even above % for some countries. to explain the importance of the daily infection growth recall study published early even before the world health organization announced the sars-cov- pandemic showing that the expected period of doubling the number of cases was between . - . (average . days) [ ] . it means that it is expected daily increase in the number of cases between . - . % (consensus . %). first comments on these findings were not favorable since the majority of readers assumed that period doubling will be significantly longer. however, for daily growth of % period doubling is only . days, for example, spain experienced a daily increase of infections at an early stage of the epidemic (the early stage, in this case, was only several days before growing to thousands of cases and hundreds of dead) was more than . % meaning period of doubling of only . days! this fact, with the severity the health conditions of many patients requiring icu treatment and assistant ventilation, is the reason why this infection was the most serious health challenge within a lifespan. note that at some stage in the usa perioddoubling was only . days. prediction is performed only for the number of infections. the number of fatalities is significantly smaller (about . % of the number of infections). also, the response of the number of infections to control measures is faster than in the case of other potential indicators. several first samples (between first found an infected person on mar. th and begin of clearly visible exponential growth on mar. th ) are avoided. the term sample is used for the number of infections for a given instant (date). initial prediction is performed using only five samples: two on mar. th , two on mar. th , and one on mar. th at am. within this period number of detected infections has grown from to with an average daily increase of . % what can be considered as a moderate rate for the early stage of this infection comparing to other countries. we have decided to update prediction after three samples in a row are above or below an obtained number of infected persons (ground truth according to the serbian authorities) in march. in total, we have revised the exponential model four times according to the above-described rule in march. table i figure gives true samples and obtained predictions. it can be seen that all predictions are in a rather narrow range between . % and . % for the first days. however, as can be seen from figure on mar. th or th the tide has been slowed down and initial exponential growth has been replaced with a more favorable function. then, on mar. st at am instead of predicted between and infected persons it was fortunately only . it is similar to other countries [ ]: lockdown and other quarantine measures have shown the ability to smooth exponential growth within the presumed incubation period of days. average growth in exponential phase mar. th -mar. th was . %. it means that all estimates of average daily growth performed within this period are consistent (see table i ). parameter log ( +grow%) in table i represents the slope of an interpolated linear function in the logarithmic scale accompanied by the standard deviation in estimation. it can be seen that there is no statistical difference between obtained results, i.e., that even a small number of available samples in the exponential growth phase can be used for reliable prediction. therefore, it is important to stress that prediction of the exponential growth phase without effective epidemical measures is simple and estimation of daily growth rate can be reliably performed using just several samples. of course, this holds for actual pandemic where it can be assumed that there are no persons with immunity against this agent. when some immunity in population exists or in the case when the number of infected persons approaches the number of available persons more advanced sis or sir models should be adopted [ ] , [ ] - [ ] . besides, we have checked the prediction accuracy of exponential daily growth. in table i rows denoted with mean abs, mean rel, max abs, max rel, log mean, and log max are statistics related to the accuracy of predictions: where p k is a predicted number of infected persons while o k is the true number of infected persons as announced by authorities. k= hereafter is mar. st . it can be seen that accuracy gradually improves for the first three intervals (seven days in total) while after that it deteriorates a bit since it is close to the end of initial exponential daily growth. also, it can be seen that all estimates of the number of infected persons on mar. st is above the actual number since exponential growth has been smoothed before the end of march as we already explained. ) with the largest number of infections, followed by the length of the interval between date and date . it can be seen that for the majority of countries response to epidemical measures was between - days. it took longer only in three countries. however, these three countries have good control of outbreaks with a relatively small number of infections so these results can be treated as outliers. the final column gives notes on adopted measures or related results for each of the considered countries. it remains an issue of how to predict outcomes after the exponential phase. the prediction model can be changed by employing a piece-wise exponential law or a sort of combination of linear and exponential growth models. as will be seen from results, both of these models could be used reliably but an issue is number of parameters that can make prediction challenging. namely, for each piece, we need to estimate two or three parameters and limits of intervals. dealing with a large number of parameters could lead to significant inaccuracy. so we have decided to keep the model as simple as possible. namely, after a period with exponential growth we have used generalized gamma function model given as [ , ] : where Γ() is gamma function, x is the date and ( ; , , ) g x c k γ is predicted the number of infected persons announced for a given date. parameters k and γ determine the function shape. in general, larger values of these parameters correspond to larger mode (maximal number of daily infections), mean value (number of cases), and spread (duration of an outbreak). parameter k describes also higher-order function (distribution) moments, i.e., skewness and kurtosis. for a given x, we are estimating parameters c, k, and γ for x in an interval between mar. th and x and based on that we are performing prediction, i.e., number of infected persons in forthcoming days and the cumulative number of infected persons. estimates of c, k, and γ are updated after each sample is available. a maximum likelihood approach is applied for parameter estimation. note that we have realized that obtained estimates are not so stable as in the case of exponential growth so estimation is performed on smoothed (filtered) data, i.e., the actual number of infections is replaced with the last five days average. in this manner, faster stabilization of prediction is ensured but as a drawback smoothing can cause a delay in recognition of reaction to epidemiological measures. firstly, we are interested in checking the behavior of the function after the exponential growth period concluded on mar. th . figure gives average daily growth for days after mar. th calculated as: where o is the number of cases on mar. th when initial exponential growth with an estimated daily increase of about . % has been completed. it can be seen from figure that the exponential growth with attenuated growth rate of about % (perioddoubling . days) is continued. the source of such an unfortunate response to measures adopted in the emergency decree of mar. th is unknown. it could be that some establishments continue to work like for example restaurants until mar. nd but it could be also mentioned the large influx of returnee from the diaspora. anyway, such behavior has been noted in other countries for example in germany [ ] that partial lockdown and social distancing measures in a short term can only decrease the slope of the exponential growth but not to stop it. it is interesting that the maximum of average daily growth is achieved on day k= (apr. nd ), i.e., days after stricter measures are adopted on mar. nd . again, it demonstrates that it was required about days before adopted measures give a visible effect. therefore, it is simple to conclude that in the case of any similar outbreak abrupt introduction of strong measures can significantly shorten the duration of an epidemic and also alleviate other effects (loss of life, number of infected persons, a burden on health system and intensive care units, economic losses, etc.). for example, if this growth is not alleviated by other measures until the end of april the number of infected persons could reach more than . taking into account the number of citizens and the overall economic situation of the country such pressure on the health system could be critical. figures and represent results from two stages, before mar. th with exponential growth and after with the number of cases modeled by generalized gamma model in linear and logarithmic scales. each dotted line after mar. th represents prediction obtained based on data from mar. th to that date. the thick solid line is prediction performed on the end of the considered interval ( is predicted number of infections for remaining days of april) obtained with (c ,γ ,k )=( . , . , . ). figure gives estimates of two parameters describing the generalized gamma function shape and predicted number infections γ and k for various dates within an interval. parameter k has slower changes than γ meaning that the epidemical control measures influenced more γ (duration and severity of outbreak) than k (skewness). it can be concluded that the first two sets of epidemiological control measures (mar. th and mar. nd ) caused a reduction of the γ from almost to less than . it can be seen that from date k= (apr. th ) obtained results become stable meaning that it was required at least samples to get a reliable prediction for the generalized gamma function modeling used to estimate the second part of the outbreak. recall that for the first interval modeled by the exponential function it was required only five samples to accurately predict the path to catastrophe if proper measures were not adopted. in addition, we have considered how accurate estimates are within interval mar. th -apr. th using estimation/prediction for the considered date. figure gives average errors (mean absolute, median absolute, max absolute, and mean) for considered estimates. it can be seen that from date k= (apr. th ), obtained errors are stable with a median absolute error of less than . note that the mean value of error at the end of the interval is - . , i.e., our prediction slightly underestimates the number of cases (bias is . ) what influences the accuracy of the predicting number of infections. therefore, the generalized gamma function can be used to model epidemiological outbreaks under similar conditions with acceptable accuracy for the interval when epidemiological measures become to give results in curbing infection outbreak. we have completed the assessment on apr. th and based on obtained results it has been performed estimation for the remainder of the outbreak. it should be noted that we do not know in advance anything about any political decision nor any medical progress or change in virus behavior. therefore, such predictions can be assumed as highly speculative. also, as we have already explained, our final prediction on apr. th is performed on pre-filtered data with the actual number of infections replaced by the last five sample averages. firstly, the remaining of the outbreak is predicted based on parameters c , k , and γ on apr. th . it is obvious from figure , that estimates of the generalized gamma function parameters change in relatively wide range before stabilizing on apr. th . three lines are given in figure . the thick line represents prediction. two dashed lines are presenting cases when the epidemiological situation gradually improves and with parameters (γ, k) going linearly to ( . γ , . k ) and if the outbreak continues more dramatically than predicted on apr. th with parameters linearly increasing toward ( . γ , . k ). figure gives estimates of the daily number of infections in interval apr. th to june th . the results are summarized in table iii . an additional column in table iii gives the potential influence of bias in prediction as previously explained. an interesting point is that in the rest of april all three predictions give similar results. our model from apr. th predicted new infections, while the other two predictions give results in the range [ , ] . however, there is a drastic difference in may. our prediction gives infections with range [ , ] . strict measures and their enforcement are necessary since the difference between the most favorable situation and the worst-case scenario could be substantial. finally, the situation for june is even more emphasized (partially influenced by expected inaccuracy in extrapolation process), i.e., in the most favorable condition it could be expected that the entire outbreak is completed without novel cases at the end of june while in the worst-case scenario it could be expected about (or with bias) novel infections at the end of the month (more than in total in june). fortunately, the paper revision is performed after the state of emergency is abolished on may th due to the favorable epidemiological situation in the country. a large portion of social distancing measures remain but the lockdown policy is abolished with partial re-establishment of local transportation, reopening of many businesses, etc. large gatherings are still prohibited with state borders and educational institutions closed. the partial opening of kindergartens is in progress. so we are in a position to access the accuracy of a simple predictive model based on data up to apr. th . we have updated figure with novel data represented with stars in figure . as can be seen, the daily number of infections significantly oscillates around prediction demonstrating that modeling the second part of the outbreak requires data preprocessing (filtering) as it is done section iv.b. it can be seen that the daily number of infections is significantly below prediction for the first time on apr. rd (k= with cases). it is exactly days after the second long weekend curfew, while the second drop below prediction lines is on may th (k= with infections) that is exactly days after long curfew during orthodox easter. at first glance, it seems that these wuhan-style curfews (widely criticized by the public) had an excellent effect on containing the sars-cov- outbreak. accuracy of prediction can be accessed in a better manner from figure where the cumulative number of infections for the period between apr. th and may th is given together with predictions. the developed simple predictive model gives relatively good accuracy. obtained results are within the predicted range on may th (it is within these limits for may nd - th ). however, the prediction is within wider limits when bias (dotted line) is taken into account from apr. rd toward the end of the state of emergency. note that relative error in prediction of the cumulative number of infections for interval apr. th -may th is only % calculated as: where p k is the predicted and o k is the obtained number of cases for kth day, (apr. th is k= , while may th is k= ). taking into account the abolishment of the state of emergency it cannot be used estimated parameters of the generalized gamma function to predict future trends in curbing the spread of infection since modeling is performed based on data obtained in the period with strict epidemiological measures. we have modeled the sars-cov- infection outbreak in serbia with the exponential growth model in the initial stage and generalized gamma model in the second stage. it has been shown that exponential growth can be easily estimated with just several samples and it suggests a path to catastrophe if serious epidemiological measures were not enforced timely. it is more difficult to model part of the outbreak after slowing initial exponential growth. anyway, for the first stage we needed only five samples while for the second stage it was required data pre-processing (simple filtering) and more samples available (in our case at least ). it is demonstrated that - days is required that epidemiological measures give some results. in the case of serbia, initial exponential growth can be followed days after the introduction of the state of emergency and also the maximal average daily growth after this interval is achieved about days after the second set of restrictions from mar. nd . finally, we have predicted the rest of the outbreak. in the process of the paper revision, we have compared prediction with actual data up to the end of the state of emergency on may th . on jan. th (after only days of strict measures adopted in wuhan and neighbor cities in china) it was enforced control of passengers coming from china. on feb. th government issued decree establishing stricter control of passengers coming from italy and prohibiting organized school trips to this country (italy is a popular destination for annual trips of high school pupils). however, this measure was light since direct air connection to milano and rome was kept almost intact until cutting all air connections on mar. th . controls at the airports are light without check-ups and testing of incoming passengers. it was probably a significant mistake since, according to authorities, arrivals from west european countries were the main vectors in the early days of the outbreak in serbia, i.e., stricter measures for passengers from these destinations could delay local outbreak probably for a week and significantly reduce further impact. two days later on feb. th additional controls on road and railway border-crossings were established. additional sanitary controls were established at airports nikola tesla, belgrade, and konstantin the great, niš, controlling more passengers coming from china and italy. on mar. th after a significant increase in the number of cases in serbia some stricter measures are adopted. they were similar like in many other countries with slight variations in dynamics. on mar. th it was forbidden meetings of more than persons in closed spaces (this number is gradually decreased), strict control of passengers from china, korea, iran, some regions from italy and switzerland, smaller border crossings were closed, all persons employed in services on main transit routes were obliged to wear facemasks and gloves,… an appeal to the public to postpone and avoid family ceremonies and gatherings is largely ignored causing numerous infections, especially in Čačak and valjevo municipalities. two days later after receiving dramatic news from other countries and with the emergence of more cases in serbia, entrance to the country was forbidden for foreign citizens from china, korea, switzerland, italy, iran, romania, spain, germany, france, austria, slovenia, and greece. serbian citizens returning from these countries were obliged to stay in self-isolation for days. on mar. th it was declared the state of emergency. some of the adopted measures were: stop of work of all educational institutions indefinitely, gradually within several decisions in forthcoming days it was enforced curfew for citizens from vulnerable categories over years with the exception on sundays between am and am (later this term has been changed several times) for buying necessities in selected shops, curfew for other persons for nonessential activities gradually extended from pm- am to pm- am and further to pm- am on weekends, the complete lockdown on weekends going up to hours, state borders are closed, all non-cargo transport terminated, etc. however, with this set of measures restaurants and similar establishments remained open (with hours adjusted to curfew) until mar. nd . on mar. th curfew was extended to pm- am over weekends, dog walk between pm- pm of minutes was prohibited (it is re-introduced later). minister of health on mar. th has found that there are still too many persons on streets and officials have announced plans for stricter lockdown for cities with the most severe outbreaks (in that time belgrade, niš, and valjevo, were the most critical). however, all measures were applied only to the entire country without specific measures for the most affected cities. local outbreaks appeared city by city, region by region, from pančevo on north toward novi pazar on south-west of country. in addition to restrictive measures, it seems that the general public becomes gradually more aware of the consequences of private gatherings since a significant number of infections have been traced back from these gatherings some of them with terminal outcomes. after the positive effects of these measures and the reduction of the number of cases, the state of emergency is abolished on may th . nowcasting and forecasting the potential domestic and international spread of the -ncov outbreak originating in wuhan, china: a modelling study best practice assessment of disease modelling for infectious disease outbreaks estimating clinical severity of covid- from the transmission dynamics in wuhan, china three basic epidemiological models a note on exact solution of sir and sis epidemic models the explicit series solution of sir and sis epidemic models estimating the number of infections and the impact of nonpharmaceutical interventions on covid- in european countries spread and dynamics of the covid- epidemic in italy: effects of emergency containment measures spread of corona virus disease (covid - ) from an outbreak to pandemic in the year time course of covid- cases in austria the effect of travel restrictions on the spread of the novel coronavirus (covid- ) outbreak some properties of generalized gamma distribution the exponentiated generalized gamma distribution with application to lifetime data figure . connections from belgrade airport, according to flightconnections key: cord- - a pu authors: beigel, r.; kasif, s. title: rate estimation and identification of covid- infections: towards rational policy making during early and late stages of epidemics date: - - journal: nan doi: . / . . . sha: doc_id: cord_uid: a pu pandemics have a profound impact on our world, causing loss of life, affecting our culture and historically shaping our genetics. the response to a pandemic requires both resilience and imagination. it has been clearly documented that obtaining an accurate estimate and trends of the actual infection rate and mortality risk are very important for policy makers and medical professionals. one cannot estimate mortality rates without an accurate assessment of the number of infected individuals in the population. this need is also aligned with identifying the infected individuals so they can be properly treated, monitored and tracked. however, accurate estimation of the infection rate, locally, geographically and nationally is important independently. these infection rate estimates can guide policy makers at both state, national or world level to achieve a better management of risk to society. the decisions facing policy makers are very different during early stages of an emerging epidemic where the infection rate is low, middle stages where the rate is rapidly climbing, and later stages where the epidemic curve has flattened to a low and relatively sustainable rate. in this paper we provide relatively efficient pooling methods to both estimate infection rates and identify infected individuals for populations with low infection rates. these estimates may provide significant cost reductions for testing in rural communities, third world countries and other situations where the cost of testing is expensive or testing is not widely available. as we prepare for the second wave of the pandemic this line of work may provide new solutions for both the biomedical community and policy makers at all levels. covid- is a deadly disease caused by the sars-cov- rna virus. this novel coronavirus created an epidemic of global proportions killing over , people worldwide (as of april th , ) and infecting millions. it also caused a medical crisis and unprecedented disruptions that long-term are likely to increase the risk for multiple-socio-economic downturns that are associated with both mortality and chronic disease. the pandemic created many urgent problems such as the development of antiviral medications, vaccines, and inexpensive and widely available testing capability, expanding er and icu capabilities and much more. however, proper response to the pandemic requires estimates of the rate of infection via testing. the challenge of rapid testing has created an outstanding community response from industry and academic centers producing tests to diagnose and identify infected patients. the majority of these tests rely on pcr based techniques that are very well established. newer tests are based on isothermal amplification and most recently crispr based methods using cas- . these tests can deliver a result in minutes to hours. high throughput sequencing is also an option for large scale testing. in addition to the biomedical crisis, the virus has also created major challenges for policy makers that need to make life and death decisions based on ethical, clinical and economic factors of unprecedented importance. a full shutdown of the economy and forced social distancing create a colossal stress on the economy, unemployment, and collapse of many business sectors. opening the economy would increase risk for the aging population and people with pre-existing conditions such as diabetes, cancer, cardiovascular disease, respiratory disease, and immunodeficiency. the decisions facing policy makers are very different at the beginning of an emerging epidemic where the infection rate is low vs middle stages (when the rates are rapidly climbing) and at the later stages where the epidemic curve flattened to a low and relatively sustainable rate. during the early stages it is relatively inefficient to test millions of patients who might be suffering from symptoms caused by rsv or influenza. at the same time is it important to monitor any unexpected turns in the progression. in many rural communities the rates usually remain low except local bursts that need to be contained. similarly in third world countries it is economically and practically impossible to perform many tests early on. mathematically, the problems of identifying infected individuals ( identification ) and estimating the total number of infected individuals in a given population ( infection rate ) are related but in fact can be addressed by subtly different algorithms to reduce the number of tests needed and thereby the total cost of doing testing. these methods generally rely on a well-studied area called combinatorial group testing. however, as we will demonstrate in this brief communication, estimating the number of infected individuals can be solved by novel adaptation of methods developed in theoretical computer science aimed at approximate counting. here we refer to these methods as aca (approximate counting algorithms). more specifically, we describe comparatively efficient methods enabling estimation of the total number of infected individuals. intuitively, we pool samples from multiple individuals and repeatedly test these pools for the virus with a single test (or perhaps a small constant number of tests to achieve better sensitivity). we provide a detailed analysis of the accuracy of the approximate counting procedure with both theoretical analysis and simulation. as a simple example, if the infection rate in a population of , , is around % we can produce an unbiased estimate of the infection rate with variance approximately · − by making group tests. in contrast, testing a sample of individuals to estimate the infection rate would produce an estimate whose variance is . · - . our variance is approximately times smaller. in addition to rate estimation we provide a review and analysis of several identification algorithms that can be deployed in communities with low infection rates that achieve reasonable improvement over the standard algorithms for group testing that have been previously explored. we first abstract the key computational problems as follows: • estimate the rate of the infection in the population or approximately count how many people test positive in a population of a given size with as few partially pooled tests as possible. • identify the people who are positive with as few partially pooled tests as possible. we focus on these problems in the context of the population that are actively suffering from the disease, rather than post-infection and recovery. we assume the testing is done using established genomic testing procedures. while similar methods are feasible to implement to detect individuals that have or already had the disease (e.g. via antibody presence testing) we expect this number to be significantly higher and our methods are comparatively more effective for lower fractions of individuals testing positive. batch testing assumption: given a set of samples from a set s of people (infected or not), it is technically feasible to form a single batch consisting of all samples and test that batch with a single test. the batch will test positive if and only if at least one person in s is positive. this assumption is reasonable for small batch sizes (e.g. ≤ |s| ≤ ). we will provide methods to alleviate this technical restriction on batch size in the discussion. henceforth, n will refer to the (known) number of people (total size of the population) and k will stand for the (possibly unknown) number of people who will test positive. thus k/n is the infection rate. we study two problems: identification of infected individuals and estimating k by approximate counting . the formulas used throughout the methods section are provided in the supplement for convenience. typically, one estimates the infection rate by sampling individuals. this estimate of the rate using the sample mean, is known to be highly inaccurate when the probability of infection p is small, because the variance of the estimator is much larger than p . using group testing, we will produce estimates whose variance is asymptotically proportional to p . therefore our proposed methods are superior to sampling individuals when p is small. we will define a random variable y such that • y can be calculated by making ⌈logn⌉ batch tests • e( y ) = ϴ(k), i.e., e( y ) is provably asymptotically proportional to k • e( y ) can be computed exactly given n and k • using linear regression we can find constants a and b such that p ≈ e((a y +b)/n) • the variance of the estimator (a y +b)/n is o(p ). in contrast, the variance of the sample mean estimator is p( −p)/⌈logn⌉. when p is small, our estimate is much more accurate than sampling individuals. we will also define a random variable w such that • w can be calculated by making m⌈logn⌉ batch tests, where m is a parameter • w is the arithmetic mean of m independent copies of y • we will use w similarly to obtain a nearly unbiased estimator for p whose variance is o(p /m). • in practice this is better than using the arithmetic mean of m independent copies of y . some identification algorithms based on batch testing are already known. we design two new highly parallel algorithms that are efficient for small p. the batch size for these algorithms is not fixed, but instead can be chosen optimally by making a calculation based on the estimated infection rate. we will describe how to choose an algorithm and its batch size given n and an estimate for k. one particularly favorable aspect of these algorithms is the fact that they use very few rounds of group testing which makes them easier to implement in practice than competing methods. to provide an intuitive example illustrating the principle of group testing methodologies we begin with a magic trick which is folklore in popular mathematics (figure ). we ask the reader to think of a number x between and , e.g., x = . we now perform five binary tests on groups we carefully design. each test returns if the number x is in the group and otherwise. in this case the result of the tests would be = , thereby identifying the hidden number. the tests and their results are listed below in figure for completeness. _________________________________________________________________________ is your number in this set: { , , , , , , , , , , , , , , , }? _ _ is your number in this set: { , , , , , , , , , , , , , , , }? _ _ is your number in this set: { , , , , , , , , , , , , , , , }? _ _ is your number in this set: { , , , , , , , , , , , , , , , }? _ _ is your number in this set: { , , , , , , , , , , , , , , , }? _ _ _________________________________________________________________________ figure : the magic trick: the answer is = . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . https://doi.org/ . the magic trick algorithm is described in full below. number the people through n− . write those numbers in binary. number the bits right to left starting from . (the i-th bit is in position i in the binary representation. alternatively, bit i (x) = (x div i ) % .) • let m = ⌈logn⌉. all logarithms are base unless otherwise specified. • let x be the number whose binary representation is b m b m− ···b person number x is the one who is testing positive, as revealed by the magic trick above. _________________________________________________________________________ complexity analysis: each person sample is divided into ⌈logn⌉ samples. exactly ⌈log⌉ tests are performed, and they can all be performed in parallel. the logistics of the process of producing the pools (batches) is not considered in this paper and can be performed by robotics or multi fluidic platforms. the size of the batches is n/ , which may potentially pose sensitivity and engineering challenges. however, when k = n/ , information theory tells us that at least n − . logn − tests are required, so we cannot do significantly better than just testing every individual. we now describe approximate counting algorithms that use pools of samples to estimate accurate infection rates. we also provide sketches of the complexity analysis that provide bounds on the number of tests needed to estimate these rates. as alluded to before we are focusing on testing populations with a relatively low disease rate. approximate counting algorithm (aca ): number the samples randomly through n. choose independently subsets of size ⌈n/ ⌉, ⌈n/ ⌉, ⌈n/ ⌉, ⌈n/ ⌉, …, . let y = the number of subsets that test positive. then ey ≈ log(k) where k is the number of infected individuals. complexity analysis: each person provides a single sample. ⌈log n⌉ tests are performed, and they can all be performed in parallel. the largest batch size is ⌈n/ ⌉, which may pose some challenges as mentioned above. if k is not very small it is still possible to deal with the large batch size issue. for example, if the maximum allowed batch size is b then we could assume that all batches larger than b individuals would give the same test result as the largest batch. if all batches test negative, then we would estimate that the infection rate is less than /b. we can run aca several times to produce a more accurate estimate w, which is discussed in the results section. . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . https://doi.org/ . / . . . doi: medrxiv preprint in this section we will follow up on the rate estimation algorithms from the previous section and develop exact identification algorithms (both probabilistic and deterministic) of infected patients in a population with a low infection disease rate. we will present three algorithms , analyze their comparative performance and use each judiciously for the appropriate infection rate we estimated in the previous section. we note that algorithm is the natural approach that has been recently used in nebraska with pool size s= without optimizing the choice of the pool size to reduce the number of tests. to be fair to this algorithm, sensitivity of testing in very large pools may be reduced without proper optimization and therefore we present algorithm (small pool size) below for completeness (fig ) . . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . https://doi.org/ . our first new identification algorithm (pia ) is given in fig . analysis: the analysis of algorithm pia is given in the short summary below. • the probability that a pool contains positives is . as an example consider this analysis with n = , k = , and s= . this is roughly a population of a small town with infection rate . . we test pools to start. on average . pools will contain positives, . will contain exactly positive, and . will contain more than positive. we apply the magic trick algorithm on . pools of size , which takes . log( ) = . tests. we check the results of the magic trick algorithm with . · = . tests. then we test the remaining . people. on average, the number of tests performed is . . on average, . * = . people are classified in the st round, . * = people are classified in the rd round, and the remaining . people are classified in the th round. on average, we obtain an individual's test result in . rounds. we might test the remaining people recursively, but that is harder to analyze because the expected number of remaining tests is not linear in the number of remaining people. recursion also increases the average and worst-case time to obtain an individual's test result. we would therefore use smaller pools than . for example, if we use pool size , we make only . tests on average. in fact, pool size is optimal. we calculated optimal pool size for several (n,k) pairs. empirically, optimal pool size seems to depend primarily on k/n. when k/n = . , optimal pool size is or in the examples we tried. we now improve on the previous algorithms described in this section and present an improved probabilistic identification algorithm (pia ) for a specific range of infection rate. the algorithm and its analysis are given below in figure . is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . run algorithm . on s to identify two people y and z test y, z, and the analysis of the pia algorithm is provided in the list below. • the probability that a pool contains positives is . to summarize, on the average, the number of tests performed by algorithm is: example : consider: n = , and k ≤ . pia with s= is the best choice. the expected number of tests is . . for comparison, algorithm performs . tests on average with its optimal pool size. example : n = , and k ≤ . pia with s= is the best choice. the expected number of tests is . . for comparison, pia performs . tests on average. we now provide a deterministic identification algorithm applicable to populations with small infection rate (fig ) . . a brief outline of the efficiency of the methods is provided in the list below. • identification given k = : ⌈logn⌉ tests in parallel (magic trick algorithm) • identification given k ≤ : ⌈log(n+ )⌉ tests in parallel (left to the reader) • identification given k ≤ : . logn − tests in loglogn − rounds (described below) . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . analysis: let t(n) denote the number of tests made by da . . assume logn is a power of . • ) + (n) t ( t ≤ √n • t( ) = by solving the recurrence we find t(n) ≤ . logn − when logn is a power of . example. what is t( )? • t( ) = t( ) + t( ) + = • t( ) = t( ) + t( ) + = • t( ) = t( ) + t( ) + = t( ) + t( ) + = note that t( ) < t( ) + . additional examples: • t( ) = t( ) + t( ) + = • t( ) = t( ) + t( ) + = t( ) + t( ) + = t( ) + t( ) + = . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . we first present a few empirical findings of approximate counting algorithms for infection rate estimation that generally support our ability to produce relatively accurate estimates using the methods provided in the previous section. : this graph displays the nearly linear relationship between k and e( y ). in order to exploit this relationship in practice, we simply run a program that calculates e( y ) for a known value of n and various values of k, and then perform a linear regression. figures - display the comparative accuracy of our pooling algorithms for estimating infection rates. in particular, figure demonstrates the reduction in variance obtained by running aca multiple times. in order to estimate the infection rate we run aca m times. let w be the average of those results. the number of infected individuals k is linear in w . we compute a linear regression to determine constants such that k ≈ a w + b. then the infection rate, p, is approximately (a w +b)/n. we refer to this algorithm as maca . (in practice it is better to perform linear interpolation using two adjacent data points. the linear formula, however, is needed in order to estimate variance.) . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . https://doi.org/ . we now summarize the results of our probabilistic identification algorithms that we refer to as pia#. we start by providing a sample of our empirical findings using a few selected examples in table . the full graph is provided in figure for a relatively large population size. population size best algorithm p = . ≤ n ≤ pia p = . ≤ n ≤ pia p = . ≤ n ≤ pia p = . ≤ n ≤ pia p = . ≤ n ≤ pia p = . ≤ n ≤ individual testing (n tests) in figure , we graphed the behavior (expected number of tests) of pia , pia , and pia when n = and < p < . . we also observe that the optimal pool size seems to depend mostly on p. an information-theoretic lower bound on the number of tests required is . when p ≥ . , we found that the number of tests performed is usually less then log ( ) k n . . log ( ) k n figure : the expected number of tests performed by different algorithms for different infection rates p. pia is graphed in yellow. pia in red and pia in blue. pia generally outperforms the other algorithms when p ≤ . . in this graph n = , , . . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . https://doi.org/ . / . . . doi: medrxiv preprint the covid- pandemic produced a very high toll worldwide and created technological, biomedical and clinical challenges. in this paper we described a cost effective application of approximate counting methods to estimate infection rates. these methods can help make crucial and rapid policy decisions with reduced investment, especially in early or late stages of the epidemic or in underserved communities, e.g., rural counties or third world countries. we also reviewed several identification algorithms and introduced new ones that provide good to modest improvements in the number of tests for different infection rates. in some cases our methods reduce the number of tests by two fold which is significant. estimating the infection rate without identification has many applications. these estimates can help policy makers determine appropriate guidelines for opening or closing the economy or implementing different types of social distancing procedures. the new method lends itself naturally to performing a very small number of tests on autopsy samples and is potentially useful in counting the number of individuals who were infected prior to death. we also note that our proposed counting method can speculatively be extended to large scale vaccine trials. a large population can be vaccinated and another population can be used for placebo. we can use our cost efficient estimates to compare the rates in these two cohorts. in fact, we expect our variance estimates to be even better than we report above. our specific identification procedures of infected individuals improved on the number of tests that must be conducted saving in costs, especially when the infection rates are low and most tests are negative. notably, we use a very small number of rounds as compared to the best competing algorithms, enabling quicker turnaround. in this paper we primarily focused on using established methods such as pcr or isothermal amplification that have been shown to produce the most reliable diagnostics in the past. we have not considered multiplex-pcr as an alternative. there are newer diagnostic procedures based on crispr (cas- ) under development. these methods vary in sensitivity, availability and cost. in theory, it is also possible to barcode every sample with a genomic tag, pool the tagged sequences into a very large multiplexed dna sample and submit the sample for high throughput sequencing. standard procedures would allow us to read the sequences and allocate any detected viral dna to the appropriate individual using tags. error correcting procedures can be deployed to produce an efficient library of tags. however, high throughput sequencing presents its own challenges and therefore we focused on pooling samples and testing using more traditional approaches. these established methods might also be easier to deploy in third world countries or rural communities that do not have access to high throughput solutions. this work can be extended in multiple useful directions both mathematically and technologically. it is highly timely for a careful consideration and possible deployment given the expected flattening of the infection curve as we approach the summer of and the potential need to detect the disease in the fall of . . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not certified by peer review) the copyright holder for this preprint this version posted may , . . https://doi.org/ . the formulas in this section were used in . proving our theoretical results about expected value and variance of y and w . creating tables of e[ y ] for a given n (and all k) and e[ w ] for given n and m (and all k) . creating tables of e[ y ] and e[ w ] the tables of e[ y ] and e[ w ] make it possible to estimate k via linear interpolation. the tables are also used with linear regression to obtain our empirical findings about expectation and variance. formulas used with aca formulas used with maca : • for any c > , [c ] ( c )p ) . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not certified by peer review) the copyright holder for this preprint this version posted may , . . https://doi.org/ . one test. this is how you get there evaluation of group testing for sars-cov- rna survey, foundations and trends in communications and information theory an optimal procedure for gap closing in whole genome shotgun sequences multi-node graphs: a framework for multiplexed biological assays muplex: multi-objective multiplex pcr assay design rapid molecular detection of sars-cov- (covid- ) virus rna using colorimetric lamp yinhua zhang massively multiplexed nucleic acid detection using cas the complexity of approximate counting counting large numbers of events in small registers key: cord- -o libq d authors: grinfeld, m.; mulheran, p. a. title: on linear growth in covid- cases date: - - journal: nan doi: . / . . . sha: doc_id: cord_uid: o libq d we present an elementary model of covid- propagation that makes explicit the connection between testing strategies and rates of transmission and the linear growth in new cases observed in many parts of the world. an essential feature of the model is that it captures the population-level response to the infection statistics information provided by governments and other organisations. the conclusions from this model have important implications regarding benefits of wide-spread testing for the presence of the virus, something that deserves greater attention. apart from being a world-changing calamity, the present novel coronavirus pandemic is an intellectual challenge for biologists, statisticians and applied mathematicians. modelling efforts that purport to predict the course of the pandemic and the effect of public health policies, usually take the form of substantial individual-based models and are implemented in code taking thousands of lines. their predictive ability is disputed but it is doubtless that they do not help us to understand the pandemic. we suggest exactly the opposite: we formulate essentially a two-equation model of one aspect of the pandemic, and claim that it can very simply explain the following puzzling phenomenon: in many countries the rate of appearance of new cases is linear. as an example, we present the data for sweden in figure [ ]. in fact, sweden is a good case to work with as there are no complications to do with lock-down; similar graphs can be created, from example, from the data for the state of georgia [ ], among many others. the modelling of an epidemic on the population level usually divides it in cohorts such as susceptible, infective and recovered (a so-called sir model), plus possibly some further sub-populations (e.g. asymptomatic or exposed (seir models)) [ ] . the evolution of these cohorts is then modelled using rate equations that include the probability that the disease is transmitted through random contacts between them, amongst other events. in doing this, time is considered as a continuous variable. however, in order to understand the linear growth phenomenon mentioned above, we believe that it is essential to include the public response to the data that are usually made available on a daily basis. indeed, we would argue that capturing the response of the population to the information stream is essential if the model is to be of use in truly understanding the pandemic. therefore, in our approach we will consider time as a discrete variable measured in days, and develop a model for the discrete evolution of the number of infectives from day-to-day. although unusual, this approach has been successfully used elsewhere in epidemiology; for a recent example, please see [ ] ii. models we derive, in its simplest and most illuminating form, a system of two difference equations for the rate of growth of new positive test results and the number of people that have been exposed to the virus; that is, we neglect the asymptomatics. the time variable n that we use is measured in days. we denote the average latent period (here and below we use epidemiological data from [ , ] ) by l; it is about a week. it is known (again, see [ , ] ) that individuals start shedding virus and so are infective very soon ( - days) after exposure and about - days before the appearance of symptoms. once the simplest model is derived, we consider the case with asymptomatics, which does not offer any substantial new illumination, but is more realistic. let us call the number of positive tests on day n, t (n). then, not taking into account false positives and negatives, but not assuming that every person showing active symptoms is tested, ( ) here j(n) are people who have shown covid- symptoms on day n, and qj(n) is the fraction of those who have been tested on that day, and d s (n) are the positively testing members of the public who show no symptoms (perhaps yet). the subscript s is to indicate that this is only from a sample. now, if the rate of testing is p, d s (n) = pd(n), where now d(n) is the population numbers of people who have virus by pcr but do not show symptoms yet. let us denote by e(n) people who got exposed on day n. in a model without asymptomatics, here l is the latent period and for simplicity we have assumed that people become infectious immediately after exposure. now we need to model the dynamics of e(n). (a ) since the numbers of infectives are very small compared to the number of susceptibles s(n), we assume that the number of susceptibles is roughly constant and that s(n)/n ≈ , n being the total population size. (a ) we assume that the number of infectives available for infecting the rest of the population on day n is approximately d(n) as p is small (of the order of · − in the uk). thus, we assume that the . cc-by . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted june , . . moment a person shows symptoms of the disease, she is removed from circulation by hospitalisation or quarantine. that is, we are making an assumption of perfect isolation. it follows from the reasoning above that people who are exposed at time n, e(n), are determined from the following simple equation: where the other arguments of f will be discussed later. (a ) we assume that f (·, (d(n)) can be written as f (·, d(n)) = cg(·)h((d(n)). the constant c which hethcote and van den driessche [ ] call "constant contact rate" in our view should only incorporate the probability that an encounter between a susceptible and an infective leads to disease. so presumably wearing face masks or other personal protection measures will be expected to reduce c. we take h(d(n)) to be a monotone increasing bounded function with the properties that h( ) = . e.g. a function of michaelis-menten type; see [ ] for other examples. (a ) we assume that the function g expresses the information stream of the population, that is, it is the translation of the information that people have into behavioural strategies governing the contact rates of the population (the same function governs also the contact rates between two susceptibles as there is no sure-fire way to determine in a contact between people not showing symptoms who is infected and who is not). (a ) we assume that the information stream is dominated by the rate of increase of the numbers of new positive tests. it would be interesting to investigate models in which g is a function of more than the last day's data, or of undominated maxima in the number of new cases, but we assume here for simplicity that r(n) is a reasonable proxy for the information stream. in other words, we assume that g(·) is a function of r(n). common sense suggests that it is a monotone decreasing function defined on r + . a possibility is g(z) = a + bz r , r ≥ . ( then g( ) = a can be interpreted in terms of the norms of sociability in a population. the logic is that if the public is aware of high rate of increase in new cases, it becomes more risk-averse. note that the information stream is in terms of what is publicly known. thus the dynamics of the exposed cohort is governed by e(n + ) = cg(r(n))h(d(n)). what we have to say about covid- is then summarised in one sentence: if the information stream is based on the number of new cases (for which r(n) is a proxy) and quarantining/hospitalisation of symptomatic cases is perfect, linear increase in the number of positive tests is to be expected. this is obvious. clearly from ( ), at a fixed point (r * , e * ) and cg(r * )h(le * ) = e * which apart from the fixed point (r * , e * ) = ( , ), in which disease is stopped, may admit a unique non-trivial fixed point, the e-component of which solves . cc-by . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted june , . . (if such a fixed point does not exist, the epidemic disappears.) if this fixed point is stable, the number of positive tests necessarily grows linearly and from that rate of growth, the number of newly exposed people can be estimated. note that the value of this equilibrium rate is an increasing function of c and also of l since the function g is monotone decreasing. under reasonable assumptions on g and h (for example, h being of michaelis-menten type and g as in ( )) it is easily seen for ( ) that e * is a decreasing function of both p and of q, since the right-hand side of ( ) is monotone increasing and g is monotone decreasing. under these assumptions on h and g, the equilibrium number of positive tests will grow (sublinearly) with p and q as is to be expected. a similar analysis with the same conclusions, can be performed in the case when the information stream is a weighted averager(n, m ) of the rates from a number m of days, i.e. ifr(n, m ) = w k = ; the value of the steady rate is independent on m or the weights w k , but these of course influence the stability of the fixed point. this model does not add much to our understanding from the previous model, and the purpose of this subsection is simply to show that it subtracts nothing either. we need to introduce three new parameters: α and β, and k. α and β are both in ( , ). α measures the proportion of exposures leading to an asymptomatic state (realistically this seems to be about . − . ) and β measures how infectious is an asymptomatic individual relative to an infected one. k > l is the average duration of an asymptomatic disease. we need just to modify both ( ) and ( ) as now the testing also finds some of the asymptomatics who are beyond the latent period. now j(n) = ( − α)e(n − l). e(n + ) = cg(r(n))h(i ef f (n)), where the number of people available for effective infection on day n is the argument from now is as before. for example, if indeed α ≈ . and as in the uk, q ≈ . , with r * in the uk being approximately , we have that e * ≈ , i.e. the number of people exposed to the virus each day is about times the number of new cases. we found the linear growth rate of the number of covid- positive tests puzzling and have provided a simple framework in which such a dynamic can be expected. in other words, our elementary analysis is in the framework of peircian abduction, for a good review of which see psillos [ ] . the linear rate is "democratically" determined by the behaviour of the individuals as well as by the rate of testing. we also presented reasons why our assumptions are sensible. we hope the present work is a contribution to the effort to "come to grips" with the pandemic, albeit in a very rough-and-ready and partial fashion; this . cc-by . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not certified by peer review) the copyright holder for this preprint this version posted june , . . rough and ready way still allows us, if we have access to additional information, such as that available for the uk from the kcl and zoe site [ ] to estimate the number of asymptomatics, exposed, and effective spreaders. as a last remark note in the proposed model, government strategy, expressed in the parameters p and q that determine the published numbers of new cases, directly influences individual behaviour, a feedback loop that does not seem to have been sufficiently discussed. this feedback has to be understood thoroughly in order to craft more effective public health policy. an exit strategy from the covid- lockdown based on risk-sensitive resource allocation the incubation period of coronavirus disease (covid- ) from publicly reported confirmed cases: estimation and application clinical presentation and virological assessment of hospitalized cases of coronavirus disease in a travel-associated transmission cluster some epidemiological models with linear incidence an explorer upon untrodden ground: peirce on abduction key: cord- -x sv ieu authors: gollier, christian title: pandemic economics: optimal dynamic confinement under uncertainty and learning date: - - journal: geneva risk insur rev doi: . /s - - - sha: doc_id: cord_uid: x sv ieu most integrated models of the covid pandemic have been developed under the assumption that the policy-sensitive reproduction number is certain. the decision to exit from the lockdown has been made in most countries without knowing the reproduction number that would prevail after the deconfinement. in this paper, i explore the role of uncertainty and learning on the optimal dynamic lockdown policy. i limit the analysis to suppression strategies where the sir dynamics can be approximated by an exponential infection decay. in the absence of uncertainty, the optimal confinement policy is to impose a constant rate of lockdown until the suppression of the virus in the population. i show that introducing uncertainty about the reproduction number of deconfined people reduces the optimal initial rate of confinement. academic economists have recently spent a huge amount of energy to better understand the science of pandemic dynamics in the face of the emergence of the covid- . economists are contributing to the analysis of the covid- crisis by integrating economic dimensions to the models, such as the economic cost of social distancing and the statistical value of lives lost. these are key elements necessary for public and private decision-makers interested in shaping strategies and policies that minimize the welfare cost of the crisis. my preferred reading list on this issue as i write this paper is composed of papers by acemoglu and chernozhukov ( ) , alvarez et al. ( ) , brotherhood et al. ( ), favero et al. ( , fischer ( ) , greenstone and nigam (may ), miclo et al. ( ), pindyck ( . this investment by the profession is impressive and highly policy-relevant. it raised critical debates about, for example, when and how much to deconfine people, who should remain confined longer, the value of testing and tracing, or whether the individual freedom of movement should be limited. one of the most striking feature of the crisis is the deep uncertainties that surrounded most parameters of the model at the initial stage of the pandemic. to illustrate, here is a short list of the sources of covid- uncertainties: the mortality rate, the rate of asymptomatic sick people, the rate of prevalence, the duration of immunity, the impact of various policies (lockdown, social distancing, compulsory masks, …) on the reproduction numbers, the proportion of people who could telework efficiently, and the possibility of cross-immunization from similar viruses. still, all models that have been built over such a short period of time by economists assumed no parameter uncertainty, and i am not an exception (gollier ). this is amazing. large discrepancies between the predictions of these models and their associated "optimal" policies do not illustrate deep disagreements about the dynamics of the pandemic, but rather deep uncertainties about the true values of its parameters. this parameter uncertainty should be recognized and integrated in the modeling. economists are well aware that uncertainty is typically a key component to explain observed behaviors and to shape efficient policies. precautionary savings, option value to wait before investing, risk premia on financial markets, insurance demand, risk-sharing and solidarity mechanisms, and preventive efforts are obvious examples to demonstrate that risk and uncertainty are at the heart of the functioning of our society. but in the cases of climate change and covid- , we most often assume no uncertainty to make policy recommendations in spite of the fact that uncertainty is everywhere in these contexts. i feel this fact as an impressive failure of our profession to be useful to make the world better. in this paper, i go one step towards including risk in the shaping of efficient pandemic policies. suppose that a virus has contaminated a small fraction of the population, and that no treatment or vaccine is available. because of the high lethality of the virus, i suppose that the only feasible strategy is to "crush the (infection) curve" by imposing a partial lockdown. the intensity of the confinement can be adapted in continuous-time to the evolution of the pandemic to minimize the total cost of the confinement. following pollinger ( ) , i show that in the absence of uncertainty, the optimal intensity of the lockdown should be constant over time until the eradication of the virus in the population. the optimal confinement intensity is the best compromise between the short-term cost of increasing the confinement and the long-term benefit of reducing the duration of the confinement. confining people modifies the reproduction number. under the standard sir pandemic model (kermack and mckendrick ) , there is a quadratic relation between the instantaneous intensity of the confinement and the instantaneous reproduction number. consider the situation prevailing in the western world in april , after a partial lockdown was imposed. in this context, suppose that the reproduction number under full lockdown is known, but the reproduction number under full deconfinement is uncertain. this uncertainty will evaporate within a few weeks by observing the propagation of the virus under the partial lockdown. how should this uncertainty with learning affect the initial intensity of the lockdown? surprisingly, i show that it tends to reduce it. to obtain this result, i assume that the representative agent is risk-neutral. however, risk plays a role in this model because of two non-linear interactions: the quadratic relation between the cost of confinement and the instantaneous reproduction number, and the hyperbolic relation between the reproduction number and the duration of the pandemic. this double non-linearity makes the analysis quite complex, and i have been able to prove the main result only in the case of small risk. the calibration exercise suggests that my result holds for large risks too. i use a simplified version of the sir model that has been introduced by pollinger ( ). it is assumed in the sir model that the rate of change in new infections is equal to the sum of the rates of change in the number of infected and susceptible people in the population. when crushing the curve, the rate of change in the number of susceptible people remains almost constant at zero. for example, between early april to mid july , the number of susceptible people in france was officially estimated to be reduced by , persons, for a population of million people. taking account of unaccounted infections, the number of susceptible persons has been reduced by just a few percents. during the same period, the number of infectious people in france went down by a factor larger than . thus, when crushing the curve, the dynamics of the pandemic is almost entirely driven by the rate of change in the number of infectious people. in this paper, i assume that it is entirely driven by changes in the prevalence rate. this approximation is exact when the initial prevalence rate tends to zero, assuming a reproduction number r less than unity. my results hold only under this approximation. an important unsolved question is related to the impact of uncertainty on the initial prevention effort when the initial rate of prevalence is large, or when the objective is herd immunity ( r > ). there is a long tradition in decision theory and finance on optimal choice under uncertainty and learning to which this paper is related. it is closest to the literature on the real option value to wait introduced by mcdonald and siegel ( ) and popularized by dixit and pindyck ( ) . an important message from this literature is that risk-neutral agents could optimally reduce their initial effort to achieve a long-term goal to obtain additional information about the welfare impact of this effort. i obtain a similar result in this pandemic model. as in all real option value models, there is a cost and a benefit in reducing the initial lockdown. the benefit is the reduced immediate economic, social and psychological costs associated to confining people. the cost of reducing the initial lockdown is that it will increase the uncertain duration of the lockdown necessary to eradicate the virus, or to increase the intensity of the lockdown in the future. the uncertainty surrounding the reproduction number affects this expected cost because of the intricate non-linearities in the duration of the pandemic and in the sensitivity of the optimal future lockdown to new information. it happens that the uncertainty reduces the expected cost of reducing the initial intensity of the lockdown, so that it is optimal to initially confine people less intensively. my model is based on the classical sir model developed by kermack and mckendrick ( ) to describe the dynamics of a pandemic. each person is either susceptible, infected or recovered, i.e., the health status of a person belongs to {s, i, r} . this implies that s t + i t + r t = n at all dates t ≥ . a susceptible person can be infected by meeting an infected person. following the key assumption of all sir models, this number of new infections is assumed to be proportional to the product of the numbers of infected and susceptible persons in the population, weighted by the intensity of their social interaction. with no further justification, this is quantified as follows: i will soon describe how t , which measures the intensity of the risk of contagion of a susceptible person by an infected person at date t, is related to the social interactions between these two groups and by the confinement policy. once infected, a person quits this health state at rate , so that the dynamics of the infection satisfies the following equations: the pandemic starts at date t = with i infected persons and n − i susceptible persons. i assume that the virus is eradicated when the number i t of infected persons goes below i min , in which case an aggressive tracing-and-testing strategy is implemented to eliminate the last clusters of the epidemic. because on average an infected person remains contagious for a duration ∕ , and because the instantaneous number of susceptible persons infected by a sick person is t s t , the reproduction number at date t equals herd immunity is obtained when the number of infected persons start to decrease over time. from eq. ( ), this is obtained when the number of susceptible persons goes below the herd immunity threshold s * = ∕ t , i.e., when the reproduction number goes below . in this paper, i focus on policies aimed at "crushing the curve", where r t remains permanently below unity. other policies, such as the laissez-faire policy or policies aimed at "flattening the curve", consist in building herd immunity through a rapid or gradual infection of a large fraction of the population, implying a large health cost but a limited economic cost. when crushing the curve, a sufficiently strong confinement is imposed to the population to maintain the reproduction number permanently below , so that the virus is eradicated relatively quickly. under this family of scenarios, the number s t of susceptible persons remain close to unity, very far from herd immunity under the laissez-faire policy. this implies that the changes in i t s t in eq. ( ) mostly comes from changes in i t . following (pollinger ) , i, therefore, simplify the sir dynamic described above into a single differential equation: where s is the average number of susceptible persons during the pandemic. this approximation of the sir model is exact when the ratio of infected to susceptible is close to zero. i examine policies of social distancing and lockdown. let x denote the intensity of this policy. one can interpret x as a measure of the fraction of people that are confined. for simplicity, i assume that infected people are asymptomatic and that there is no pcr test, so that one cannot discriminate the intensity of confinement on the basis of the health status. this means that x is the fraction of people, both infected or susceptible, who are confined. a free infected person has a reproduction number r f = f s∕ . i assume that there is no herd immunity at the start of the pandemic, i.e., that r f is larger than unity, or f s > . the confinement reduces this number to r c = c s∕ , with c < f . i assume that the full confinement of the population crushes the curve in the sense that r c < , or c s < . as said earlier, a crucial element of the sir model is that the speed of infection is proportional to the product of the numbers of people infected and susceptible. confining people reduces both the number of infected people and the number of susceptible persons, implying a quadratic relation between the intensity x of the confinement and propagation of the virus in the population (acemoglu and chernozhukov ) . from this observation, the pandemic parameter t takes the following form: the true contagion rate c x t + f ( − x t ) of infected people is a weighted average of the contagion rates c and f of infected people who are respectively confined and let free to live their life. they meet a reduced fraction − x of susceptible people, because the remaining fraction x is lockdown. the quadratic nature of this relation plays a crucial role in this paper. the lockdown has also an economic cost. i assume that the instantaneous cost of confining a fraction x of the population at date t is equal to wx, where w > can be interpreted as the sum of the wage and psychological costs of confinement. abstracting from discounting given the short duration of the pandemic when crushing the curve, the objective of the policy is to minimize the total cost of the health crisis. this yields the following value function: where i is the current rate of prevalence of the virus in the population. the termination date corresponds to the time when the rate of prevalence of the virus attains the eradication threshold i min . observe that i assume an objective that ignores the potential lethality of the virus. but even when the virus is lethal, policies aimed at crushing the curve typically yields economic costs that are at least one order of magnitude larger than the value of lives lost (gollier ), thereby justifying this objective of minimizing costs. pollinger ( ) derives the solution of a more general version of this dynamic problem under certainty. using standard dynamic programming techniques, problem ( ) can be rewritten as follows: or, equivalently, the first-order condition of this problem is under this notation, x is the derivative of with respect to x. equation ( ) expresses the optimal intensity x * (i) of confinement as a function of the rate of prevalence of the virus. however, let us guess a constant solution x * independent of i. from eq. ( ), this would be the case if iv � (i) is a constant. in that case, the duration t of the pandemic will be such that this equation tells us that there is an hyperbolic relation between the reproduction number and the duration of the pandemic. the total cost under such a constant strategy is this implies that iv � (i) is a constant, thereby confirming the guess that it is optimal to maintain a constant intensity of lockdown until the eradiction of the virus. combining eqs. ( ) and ( ) yields the following optimality condition for x * : the optimal intensity of lockdown is a best compromise between the short-term benefit of easing the lockdown and the long-term cost of a longer duration of the pandemic. under the quadratic specification ( ) for beta, eq. ( ) simplifies to because r c < < r f , the optimal intensity of confinement is between and . ( ) the laissez-faire to the % lockdown, the optimal intensity of confinement is √ ∕ = % . i summarize my results under certainty in the following proposition. its first part is a special case of pollinger ( ) . proposition under certainty, the optimal suppression strategy is to impose a constant intensity of confinement until the virus is eradicated. in the quadratic case ( ), the optimal intensity of confinement is , where r f and r c are the reproduction numbers under respectively the laissez-faire and the full lockdown. suppose that some parameters of the pandemic are unknown at date . suppose also that the only way to learn the true value of these parameters is to observe its dynamics over time. how should this parameter uncertainty affect the optimal effort to fight the virus in the population? i have not been able to solve the continuoustime version of this dynamic learning problem. i therefore simplified the problem as follows. i assume that parameter f is unknown. at date , a decision must be made for an intensity x of confinement under uncertainty about f . this intensity of confinement will be maintained until date . between dates and , the observation of the propagation of the virus will inform us about f . therefore, at date , f is known and the intensity of confinement is adapted to the information. my objective is to compare the optimal x under uncertainty to the x that would be optimal when ignoring the fact that f is uncertain. this is thus a two-stage optimization problem that i solve by backward induction. from date on, there is no more uncertainty. as observed in the previous section, it is optimal to revise the confinement policy to the information about the true f . we know from the previous section that the optimal contingent policy x * ( f ) is constant until the eradication of the virus. the minimal total cost of this policy is denoted v(i , f ) . combining eqs. ( ) and ( ), it is equal to it is a function of the rate of prevalence i of the virus observed at date and of the pandemic parameter f observed during the first stage of the pandemic. the first stage of the pandemic takes place under uncertainty about f . i assume risk neutrality, so that the objective is to minimize the expected total cost of the suppression strategy: where i = i exp(( (x , f )s − ) ) is also a function of random variable f . the first-order condition of this stage- problem can be written as follows: with in the absence of uncertainty, i.e., when f takes value f with probability , the optimal solution is the solution of eq. ( ) in that particular case, which implies how does the uncertainty and learning about f affect the optimal effort to mitigate the pandemic? because is a convex function of the mitigation effort x, function f is increasing in x . by jensen's inequality, eq. ( ) implies that the uncertainty affecting f reduces the optimal initial mitigation effort if an only if f is convex in its second argument. i have not been able to demonstrate, in general, that f is convex. i therefore limited my analysis to the case of a small risk surrounding f . more precisely, suppose that f is distributed as f + h , where f is a known constant, is a zero-mean random variable and h is an uncertainty-intensity parameter. i examine the sensitivity of the optimal confinement x * as a function of the intensity h in the neighborhood of h = . in the "appendix", i demonstrate that f is locally convex in its second argument, i.e., that x * (h) is decreasing in h in the neighborhood of h = . more precisely, i show that x * � ( ) = and x * �� ( ) < . this yields the following main result of the paper. ( ). introducing a small risk about the transmission rate f reduces the optimal initial intensity of confinement. proof see "appendix". i used here a very specific strategy to explore this problem. ideally, one should start with an uncertain f to which some rothschild-stiglitz increase in risk would be imposed. i limit the analysis to the special case in which the initial f is certain and in which only a small risk is added. i do not characterize the optimal reaction of a social planner faced by an increase in risk in the reproduction number of the virus. for example, i cannot tell whether the intensity of the confinement should be increased if we learn that the virus underwent some mutation that changed the reproduction number in an uncertain way. my result above only suggests that it could reduce the initial confinement effort, assuming that one pursues an eradication strategy. we should also address situations in which new social distancing measures (facial masks, ventilation of closed public spaces, … ) have an uncertain impact on the reproduction number. under these generalized framework, proposition only suggests that these sources of uncertainty could reduce the optimal instantaneous mitigation effort. in this section, i quantify the negative impact of uncertainty on the optimal confinement in the learning stage . i solve numerically the optimality condition ( ) in the quadratic context. this equation takes the following form in that case: it yields the following solution: where r f = f s∕ and r c = c s∕ are the reproduction numbers in the laissez-faire and total lockdown respectively. i first describe a simulation in the spirit of the covid- . there has been much debate about the reproduction number under the laissez-faire policy. ferguson et al. ( ) assumed that it was between and . at the beginning of the pandemic. however, i focus in this paper on a post-lockdown situation in which people have learned the benefit of washing hands, bearing masks and basic social distancing behaviors. therefore, the expected reproduction number under the laissez-faire in this new situation is probably smaller than . i assume an expected value of er f = . . for france, santé publique france has estimated the reproduction number at different stages of the pandemic. it was estimated at . at the end of the strong confinement period in may. because the confinement was partial, this observation is compatible with a r c equaling . . in fig. , i describe the optimal intensity x * in stage as a function of the intensity h of the uncertainty surrounding r f , with r f = . + h , with e = . more specifically, i consider binary distribution with ∼ (− , ; ∕( − ), − ) . in order to keep r f above with probability , i consider risk intensities h between and . . under certainty ( r f = . with certainty, or h = ), the optimal intensity of confinement is a constant √ . = . % . suppose alternatively that r f is either or with equal probabilities. in that case, the optimal confinement goes down to . % . if our beliefs about the reproduction number r f are distributed as with probability . and with probability . , then the optimal initial confinement goes down to . %. in fig. , i describe the percentage reduction in the optimal initial confinement for different r c and r f ∼ ( , ∕ ; r f − , ∕ ) . we see that the impact of uncertainty on the optimal confinement is largest when the reproduction numbers in the pre-and post-confinement are close to unity. suppose for example that r c = . and r f = . . in this context of certainty, the optimal confinement is . %. if r f is distributed as ( , / ; . , / ), the optimal initial confinement goes down to . %, a % reduction in the initial mitigation effort. the uncertainty surrounding the reproduction number when reducing the strength of the lockdown is an argument in favor of lowering the intensity of this lockdown in the learning phase of the pandemic. this rather surprising result is the outcome of two non-linearities of the model. first, the duration of the pandemic is an hyperbolic function of the reproduction number. second, the reproduction number is a fig. optimal confinement x * in stage as a function of the intensity h of the uncertainty. i assume that r c = . and r f = . + h , with ∼ (− , ; ∕( − ), − ) fig. percentage reduction in the optimal confinement x * in stage due to uncertainty for different values of (r c , r f ) . i assume that r f is distributed as ( , ∕ ; r f − , ∕ ) quadratic function of the cost of confinement. these two non-linearities explain why one should be sensitive to the uncertainty when shaping the confinement policy, but i confess that these observations do not explain why this uncertainty should reduce the optimal confinement at the first stage of the pandemic. more work should be done to explain this result. this research opens a new agenda of research that i am glad to share with the readers of this paper. for example, shame on me, i assume here risk neutrality, in spite of the large size of the risk and its correlation with aggregate consumption. could there be a precautionary motive for a larger initial intensity of the confinement? no doubt that my result should be refined in that direction. also, i limited the analysis to suppression policies. this restriction was necessary to simplify the dynamic equations of the generic sir model, so that the assumption of an almost constant number of susceptible people in the population is a reasonable approximation. this excludes the possibility to compare the optimal solution among this family of policies to other plausible policies, in particular policies aimed at attaining a high rate of herd immunity. introducing uncertainty in the generic sir model and measuring its impact on the optimal policy is another promising and useful road for research. in my to-do list, i also have the exploration of other sources of uncertainty, such as not knowing the rate of prevalence, the fraction of the population already immunized, or the time of arrival of a vaccine. finally, because the value of lives lost associated to most suppression strategies is typically one or two orders of magnitude smaller than the direct economic cost of the lockdown, i assumed that the objective of the social planner is to minimize the economic cost incurred to eradicate the virus in the population. it would be useful, as in pollinger ( ) , to incorporate the value of lives lost in the objective function. zero-mean risk surrounding f reduces the optimal confinement at stage . this concludes the proof of proposition . ◻ a multi-risk sir model with optimally targeted lockdown a simple planning problem for covid- lockdown an economic model of covid- epidemic: the importance of testing and age-specific policies restarting the economy while saving lives under covid- impact of non-pharmaceutical interventions (npis) to reduce covid- mortality and healthcare demand external costs and benefits of policies to address covid- , mimeo. gollier, c. . cost-benefit analysis of age-specific deconfinement strategies does social distancing matter? a contribution to the mathematical theory of epidemics the value of waiting to invest optimal epidemic suppression under an icu constraint, mimeo. tse pindyck, r.s. . covid- and the welfare effects of reducing contagion optimal tracing and social distancing policies to suppress covid- , mimeo estimating the burden of sars-cov- in france acknowledgements i thank stefan pollinger and daniel spiro for helpful comments. the research leading to these results has received the support from the anr grants covid-metrics and anr- -eure- (investissements d'avenir program) in the quadratic case ( ), we have that with remember that we assume that f s > and that c s < , so the signs of coefficients (a, b, c) , which are functions of f . this also implies that (x)s − alternates in sign, implying b − ac > . we have that ( ) is now rewritten as follows:as stated in the main part of the paper, let me parametrize the uncertainty by assuming that f is distributed as f + h , where f is a known constant, is a zero-mean random variable and h is a measure of the uncertainty. the optimal stage- confinement is a function of h, and is denoted x * (h) . i examine the properties of this function in the neighborhood of h = . when h equals zero, the above equation is solved withto do this, i fully differentiate the optimality condition ( ) with respect to h, taking account of the fact that (a, b, c) are functions ofwhen h equals zero, coefficients a, b, c and d are constant. because e equals zero, the above equation has a single solution when evaluating it at h = . at the margin, introducing a zero mean risk for the reproduction number has no effect on the optimal mitigation effort in stage .let me now turn to x * �� = x * ∕ h . let me specifically evaluate this second derivative at h = . fully differentiating eq. ( ) with respect to h, and using property ( ) yields where = e is the variance of . this is equivalent tothe geneva risk and insurance reviewwe have that a � = s , b � = s , d � = s(b − ) and d �� = s . this allows me to rewrite the above equation asbecause b − ac is positive, we obtain that x * �� ( ) is negative if and only if or, equivalently, let me use the following notation:after tedious manipulations, the above inequality is true if and only if is positive in the relevant domain of (v, z), i.e., v ∈ [− , ] and z ≥ . notice that h is clearly non-negative at the boundaries of the relevant domain:this implies that x * �� ( ) is negative, or that x * (h) is smaller than x * ( ) in the neighborhood of h = . in other words, any small key: cord- -o nscint authors: roy, sayak; khalse, maneesha title: epidemiological determinants of covid- -related patient outcomes in different countries and plan of action: a retrospective analysis date: - - journal: cureus doi: . /cureus. sha: doc_id: cord_uid: o nscint current development around the pandemic of novel coronavirus disease (covid- ) presents a significant healthcare resource burden threatening to overwhelm the available nationwide healthcare infrastructure. it is essential to consider, especially for resource-limited nations, strategizing the coordinated response to handle this crisis effectively and preparing for the upcoming emergence of calamity caused by this yet-to-know disease entity. relevant epidemiological data were retrieved from currently available online reports related to covid- patients. the correlation coefficient was calculated by plotting dependant variables - the number of covid- cases and the number of deaths due to covid on the y-axis and independent variables - critical-care beds per capita, the median age of the population of the country, the number of covid- tests per million population, population density (persons per square km), urban population percentage, and gross domestic product (gdp) expense on health care - on the x-axis. after analyzing the data, both the fatality rate and the total number of covid- cases were found to have an inverse association with the population density with the variable - the number of cases of covid- - achieving a statistical significance (p-value . ). the negative correlation between critical care beds and the fatality rate is well-justified, as intensive care unit (icu) beds and ventilators are the critical elements in the management of complicated cases. there was also a significant positive correlation between gdp expenses on healthcare by a country and the number of covid- cases being registered (p-value . ), although that did not affect mortality (p-value . ). this analysis discusses the overview of various epidemiological determinants possibly contributing to the variation in patient outcomes across regions and helps improve our understanding to develop a plan of action and effective control measures in the future. a significant epidemic focus of new coronavirus disease was identified in december in wuhan, china, which rapidly progressed across countries in europe, north america, asia, and the middle east, affecting more than half a million people [ ] . later, this disease outbreak, caused by a novel coronavirus sars-cov- with an unknown origin, was declared as a significant pandemic by the world health organisation (who) on march , . europe has become the new major epicenter with the total number of cases and deaths being reported as , and , , respectively [ ] . this is the third coronavirus outbreak with a novel strain in the last two decades and presents an ensuing healthcare resource burden that threatens to overwhelm available healthcare resources [ ] . as a result, the challenges presented seem to be unique in the disease prospect, considering disparate resource settings across countries, especially when applying strategies from high-technology intensive care settings to less developed areas. the global burden of covid- , which is an infectious agent with high transmissibility and a moderate fatality rate, is likely to fall hardest on the vulnerable groups in low middle-income countries (lmics). therefore, a systematic strengthening in its ability from the technical and financial fronts is warranted to respond to this challenging situation successfully. public health measures, such as surveillance, exhaustive contact tracing, social distancing, travel restrictions, public education on hand hygiene, ensuring flu vaccinations for the frail and immunocompromised, and temporarily suspending non-essential surgical procedures and services will play their part in delaying the spread of infection and dispersing pressure on hospitals [ ]. sars-cov- is likely to play havoc on the world economy, heading to an apprehended shrinkage of the global economy in by % [ ] . in this brief analysis, we tried to overview various socioeconomic determinants possibly contributing to variation in covid- -related outcomes across regions and then make a plan of action based on the evidence. given the unpredictable course of this global crisis, infection and mortality rates vary widely from one country to another. apart from the baseline demographic features of patients, socioeconomic factors, such as income groups, population density, access to health care, and quality of health system resources, may account for the observed variations in mortality rates. different testing strategies, reporting systems, and data availability play an essential part in these highly variable statistics, even with the number of unreported cases believed to be quite considerable in some countries. the total number of cases worldwide is , , and the disease has caused , ( . %) deaths from december , , to march , [ ] . however, the highest mortality was found in italy, spain, united kingdom, france, and iran. the overall case fatality rate of march in italy ( . %) is substantially higher than in china ( . %). now, the fatality rate of italy has increased to . % as of april (number of cases recorded , and number of deaths , ), which is closely followed by the united kingdom with a fatality rate of . % (number of covid- cases being , and number of deaths being , ) [ ] [ ] . this feature demonstrates the daily change in the fatality rate in a pandemic. moreover, the infections occurred predominantly ( %) in people to years old. the most likely populations requiring mechanical ventilation are the elderly and people with pre-existing comorbidities (in particular, cardiovascular disease and hypertension, followed by diabetes mellitus) with a predicted mortality of around % to % [ ] . despite the known impact of per capita expenditure on health care and reduced mortality in previous influenza pandemics ( ), the findings in the recent outbreak entail no significant correlation between health care spending and the covid- -related mortality observed in the population [ ] . in particular, the mortality rates are highly variable in different high-income countries, as noticed in europe and the usa based on their health spending. as per the world bank statistics, the us spent . % of its gross domestic product (gdp) on health care while other countries, including india, are lagging behind primarily due to the income category of these countries [ ] . surprisingly, this did not affect the total number of cases infected with covid- [ ] . this consistent observation was reported in a descriptive analysis study (in preprint). in that study, there was no significant correlation between the gdp growth of the country and the number of treating physicians/ patient population with any covid- -related outcome but a negative correlation between covid- -related deaths and the number of beds available per population. additionally, there was an inverse correlation between the number of tests conducted per million population with the rates of active infections, new cases, and new deaths due to covid- [ ] . herein, we try to analyze the impact of various socioeconomic and demographic features of a few selected countries, namely, united states of america (usa), germany, italy, france, south korea, spain, japan, united kingdom (uk), china, and india, concerning covid- -related cases and fatality rates. we retrieved data between january and april , , related to population and population density, the median age of the population of a country, urban population, number of covid- testing employed per million population, gdp expense of each country on health, critical care beds available per capita, from various sources as mentioned in the reference section, stated next to each of these variables in table , along with the total number of covid- cases and the case fatality rate (as per who situation report, [ ] ). we then applied pearson's correlation coefficient to see the correlation of various demographic features and covid- cases and deaths due to covid- using the online calculator available, https://www.socscistatistics.com/tests/pearson/default .aspx [ ] . the correlations between population characteristics and socioeconomic variables in various countries as discussed earlier with respect to outcome in terms of total positive cases and fatality rate due to covid- are summarized in table . r-value p-value there is a strong positive correlation between gdp expense on the health of a country with the number of cases getting detected. the reason for this is the affordability of easily testing a higher number of patients in high-income countries; however, this expense, on the other hand, did not show any significance with deaths due to covid- (p-value . ). the result showed that both the case fatality rate and the number of covid- cases are negatively correlated to population density, which seems quite strange. however, on further analysis, by taking the usa and the european countries only on the x-axis and case fatality from these countries on the y-axis, the same pearson's correlation coefficient r-value becomes . (p-value . ), which is now positively related and not statistically significant after adjustment. the same calculation using the population density of these countries on the x-axis and the number of cases on the y-axis gives us a pearson's r-value of . (p-value . ), which again changes from a negative correlation to a positive correlation after adjustment. the negative correlation between critical care beds and the fatality rate is justified, as intensive care unit (icu) beds and ventilators are critical elements in the management of complicated cases. this importance of ventilators had previously been recognized in the study, which states that the provision of mechanical ventilators to developing countries has the unique potential to help make a dramatic improvement in the care of the world's most vulnerable patients [ ] . mass testing in all suspected cases in germany and south korea, as laid down by who, could also be one of the reasons why they manage to reduce the number of new infections since it allows them to quickly identify possible outbreaks as early as possible for covid- , and we can also observe the pandemic curve for death rate to have bent quite effectively from the early days of the pandemic [ ] [ ] . this approach proved to be a successful strategy to achieve a low fatality rate in both countries, as they used to test-admit/isolate-treat protocol. the national health service (nhs) of the uk used the contain-delay-mitigate-research strategy at the beginning and that turned out to be futile, with a possible association with a high case fatality rate [ ] . the number of tests done per day differed significantly among various countries from the beginning, as shown in figure , and the day-by-day basis of covid- tests per , people in table , which clearly shows aggressive testing done to pick up cases early in south korea while italy, france, and the uk lagged way behind and, now, these three countries have a huge case fatality burden. adapted from: our world in data [ ] the covid- epidemic has placed a significant burden on the health care system. this crisis has dramatically affected the delivery of critical care due to a lack of resources. this pandemic has exposed the skeleton of healthcare systems around the world, as well as the lack of preparedness of most of the countries to tackle a major crisis like this. the present search for estimating the contributing factors to covid- -related outcomes may not be likely to have been exhaustive. however, these findings have important implications for public health actions, as much of the world will witness a massive community epidemic of covid- over the coming weeks and months. we try to make out a plan of action based on a few study reports that tried to address the gaps in nhs and the healthcare system as such in post-peak period the public will understandably wish to return to some semblance of normal life. deep economic damage will be a powerful motivation to lift restrictions on personal freedoms. but to do so too early will lead inevitably to a second peak. the government must make the public aware of this phase. to conclude, the present analysis acts as just the beginning of the development of a thorough understanding of the impact of various epidemiological factors in coronavirus disease-infected patient outcomes. this will help the resource-limited regions to strategize a coordinated response for effectively managing and preparing for the emergence of this yet-to-be-known disease entity. given the fact that this covid- pandemic, for now, will have a long-term implication for all members of society, a collaborative effort among society, government, public health experts, and healthcare professionals will be needed to ensure efficient recovery from this pandemic disaster as early as possible. human subjects: all authors have confirmed that this study did not involve human participants or tissue. animal subjects: all authors have confirmed that this study did not involve animal subjects or tissue. conflicts of interest: in compliance with the icmje uniform disclosure form, all authors declare the following: payment/services info: all authors have declared that no financial support was received from any organization for the submitted work. financial relationships: all authors have declared that they have no financial relationships at present or within the previous three years with any organizations that might have an interest in the submitted work. other relationships: all authors have declared that there are no other relationships or activities that could appear to have influenced the submitted work. covid- : towards controlling of a pandemic who. coronavirus disease ( covid- ) : situation report a review of sars-cov- and the ongoing clinical trials global economy could shrink by almost % in due to covid- pandemic: united nations world health organization case-fatality rate and characteristics of patients dying in relation to covid- in italy characteristics of and important lessons from the coronavirus disease (covid- ) outbreak in china: summary of a report of cases from the chinese center for disease control and prevention an ecological study of the determinants of differences in pandemic influenza mortality rates between countries in europe health spending frequency of testing for covid infection and the presence of higher number of available beds per country predict outcomes with the infection countries in the world by population the countries with the most critical care beds per capita coronavirus update (live) coronavirus testing criteria and numbers by country pearson correlation coefficient calculator the need for ventilators in the developing world: an opportunity to improve care and save lives who director-general's opening remarks at the media briefing on covid daily confirmed covid- deaths: are we bending the curve our world in data. total covid- tests per , people offline: covid- and the nhs-"a national scandal clinical ethics recommendations for the allocation of intensive care treatments, in exceptional, resource-limited circumstances:the italian perspective during the covid- epidemic the extraordinary decisions facing italian doctors how a south korean city is changing tactics to tamp down its covid- surge critical preparedness, readiness and response actions for covid-