key: cord-033412-acjskz00 authors: Ruan, Yongsen; Luo, Zhida; Tang, Xiaolu; Li, Guanghao; Wen, Haijun; He, Xionglei; Lu, Xuemei; Lu, Jian; Wu, Chung-I title: On the founder effect in COVID-19 outbreaks – How many infected travelers may have started them all? date: 2020-09-24 journal: Natl Sci Rev DOI: 10.1093/nsr/nwaa246 sha: doc_id: 33412 cord_uid: acjskz00 How many incoming travelers [I(0) at time 0, equivalent to the “founders” in evolutionary genetics] infected with SARS-CoV-2 who visit or return to a region could have started the epidemic of that region? I(0) would be informative about the initiation and progression of the epidemics. To obtain I(0), we analyze the genetic divergence among viral populations of different regions. By applying the “individual-output” model of genetic drift to the SARS-CoV-2 diversities, we obtain I(0) < 10, which could have been achieved by one infected traveler in a long-distance flight. The conclusion is robust regardless of the source population, the continuation of inputs (I(t) for t > 0) or the fitness of the variants. With such a tiny trickle of human movement igniting many outbreaks, the crucial stage of repressing an epidemic in any region should, therefore, be the very first sign of local contagion when positive cases first become identifiable. The implications of the highly “portable” epidemics, including their early evolution prior to any outbreak, are explored in the companion study (Ruan et al. 2020). region, within which the bulk of human interactions happen. Relative to the within-region movement, a bordered region is lightly connected to the rest of the world. Since the epidemic in any bordered region could have been started by one single infected traveler, or by 1,000 of them, we take the population genetic approach to analyzing the divergence among viral populations in relation to the "founder effect" [4] . We shall let I t be the amount of input at time t (i.e., the number of infected people coming in an uninfected region). The crucial number is I 0 , i.e., the first batch of input. The magnitude of I t is important in the public health practice. If I t has been large with continual input lasting for weeks, then a bordered region may be able to prevent the epidemic from being exported to (or being imported from) other regions, solely by restricting human movements out of (or into) its borders. On the other hand, if the epidemic in a region could be started with I 0 < 10 (and I t>0 ~ 0), then sealing off either emigration or immigration would not be effective in stopping a pandemic. Unless the bordered regions are maintaining zero infections, the danger would be coming mainly from within their borders. Here, we aim to infer I t , in particular, in the early period of an epidemic. In this study, we use a population genetic framework [5] . Because the focus is the stochastic differentiation among viral populations, epidemiological models generally do not cover such topics. The genetic drift formulation used here also permits the calculation of the extinction probability of the invasion of the virus ( [5] , Ruan et al. 2020). Epidemiological parameters, such as the number of uninfected individuals, the effects of quarantine and the development of immunity [6, 7] , are not considered here as population differentiation takes place in the earliest stage of the invasion. During this stage, neither quarantine nor herd immunity has yet become a major factor in the outbreak. Theory -To estimate I t , a conventional method is to inspect the changes in the population size of the viruses, N t [8, 9] . Viral population size corresponds to the number of infected individuals, assuming a viral clone in each person. Because N t is only weakly dependent on I t , the conventional approach does not offer the resolution we seek for. It would be more informative to examine multiple populations for their differences in genetic polymorphism. The differences would depend strongly on I t at the very beginning of the epidemic ( Fig. 1 ). For studying population differentiation, the source population infecting the travelers needs to harbor genetic variants in non-trivial frequencies to yield informative data [10] . For example, Tang et al. [11] reported the existence of two lineages that are distinguishable by two SNPs, one being a Serine/Lysine (S/L) polymorphism. According to ref. 11, the S lineage accounts for ~30% and the L lineage accounts for ~70% among the 103 viral genomes they examined. For the ease of estimating I t 's, the variants should ideally be neutral in fitness. Indeed, since variants under selection would have a short retention time in the population, SNPs are often neutral [12] [13] [14] . While the fitness differential between the L and S variants remains unclear, our simulations show that the estimation of I t is only weakly dependent on selection (see below). The estimation of I 0 as well as I t>0 are conducted for multiple viral populations, which should ideally originate from independent samples. Among these populations, we model their differentiation as a function of I t . I t is not likely to be very large between regions (including the source) reachable only by air flight, each of which may carry at most a small number of infected passengers. As long as all extant populations are derived from the same source population, the estimates of I t are only weakly dependent on the actual genetic make-up of the source. For that reason, the source population needs not be known. The hypothesis is that the viral populations seeded by the infected travelers have experienced strong fluctuation in gene frequency. This may happen at the beginning of the epidemic when N t is small. Soon afterwards, the fluctuation in gene frequencies would be quickly dampened as N t grows. The fluctuation in gene frequencies due to the random transmissions of genes is referred to as genetic drift [14, 15] or the founder effects [4] . The standard formulation of genetic drift by the Wright-Fisher model (or the alternative model of Moran [16] ) is not applicable for tracking the viral population. Instead, we use the "individual output" model we previously proposed [5] . All models assume discrete generations. Based on the infection dynamics estimated in a recent study [17] , we assume that each discrete generation is ~4 days. If we use a longer or shorter generation time, the outcome would be similar as long as the progeny production is calibrated with the generation time. From one generation to the next, each individual produces k "descendants" (or infects k others) with the mean of E(k) and the variance of V(k). In the Wright-Fisher model, k follows the Poisson distribution and V(k) is tied to E(k) [5] . In the "individual output" model, k may follow any distribution, which is often measurable but not in any common form. E(k) dictates the population growth, N t , while V(k) determines the fluctuation in N t and in gene frequency. We will attempt to obtain E(k) and V(k) from the empirical data and, for a comparison, will also allow V(k) = E(k) to approximate the Wright-Fisher model. N t is a function of E(k), V(k) and I t . Here, we assume I t to be a constant, hence, I t = I 0 for all t's. At time T, If E(k) is not too small, E(N T ) would depend mainly on I t of the first few generations. In fact, the results are often similar whether there is constant input or not (i.e., I t = I 0 , or I t = 0 when t ≥ 1). In other words, Eq. (1') below would yield similar results as Eq. (1) With reasonable accuracy, E(k) can be obtained from the growth trajectory of N t but I 0 has to be obtained by a different means. While many of the assumptions such as exponential growth and the constancy of I t may hold for only a few generations, most of our results depend primarily on the dynamics of the first few generations. The actual trajectory of each population would also depend on V(k). Using the simpler Eq. (1'), (2). To obtain I 0 for Eqs. (1) or (1'), we have to model gene frequencies. Using the example of the S/L polymorphism, we let X T be the frequency of the L lineage at time T. If the fitness of the S and L type is the same, then where X 0 is the frequency in the source population. Eqs. (3) and (4) will need some modifications if we use Eq. (1), or if we consider the fitness difference between L and S. (see Supplement). The actual realization of X T in each population is obtained by iteration described here. We assume two types of viruses (L type and S type; [11] ). The relative fitness of L type to S type is 1+s (s = 0 represents no selection). In addition, there is a source population, in which the frequency of the L type is X 0 . At generation t, there will be I t immigrants from the source population. I 0 is the founder population size. A parameter T sets the time limit of immigration. Thus, At generation t-1, the numbers of the L type and S type are L t-1 and S t-1 respectively. Also, N t-1 = L t-1 + S t-1 and X t-1 = L t-1 / N t-1 . After one generation, there will be I t (I t = I L + I S ; I L , I S are the numbers of L, S type, respectively) immigrants from the source population. In addition, N t-1 will increase to N t. Thus, at generation t, the numbers of the L and S type are where is the progeny number of the i-th individual of either type. The distribution of will be defined in the next section. If there is selection, the number of L type will change as follows: At generation t, the population size and the frequency of L type are Based on the definition above, we simulate the stochastic changes of the viral population until it reaches the 20th generation (i.e. t = 20). Since genetic drift is negligible when N t is large (e.g. >10 5 ), we simulate the trajectory by a deterministic model when N t >10 5 . To quantify the population differentiation, we calculate the pairwise Fst values [14, 18] , defined below. With X t and Y t for a pair of populations, where ̅ = ( + )/2. If Fst = 0, X t = Y t and the two populations are identical in gene frequency. If Fst = 1, the two populations are maximally differentiated with X t = 0 and Y t = 1, or vice versa. Defining the parameter set (I 0 , T, X 0 , s) and the distribution of k As stated, the conventional Wright-Fisher Model requires k (progeny number of an individual) to follow a Poisson distribution with V(k) = E(k). Here, we assume the spread of virus to be associated with the social network, which usually follows the power law [19, 20] . Specifically, we let k follow the Zipf's law (a discrete power-law distribution; [21] ): The estimate of the basic reproduction number (R 0 ) of SARS-CoV-2 ranges from 1.4 to 6.5 [22] [23] [24] . Here, we focus on the early phase of the viral population growth by using R 0 = 6.5. The relationship between E(k) and R 0 is as follows: where τ being the serial interval, G being the generation time. And τ is estimated to be ~5 days [23, 25] and and V(k) rather than the actual distribution. By setting V(k) so large, we ensure that the I 0 estimate would be on the high side (see Discussion). The estimation of I 0 as well as I t>0 should be done on multiple viral populations that are independent samples from the same source. If they are not fully independent, then our estimates of I t would be conservative (i.e., over-estimation) since any exchange between populations should reduce the divergence. Here, we generate two sets of data as shown in Table 1 In Set II, we assign gene frequencies to 10 populations in Table 1 using the reported frequencies as a guide (GISAID (https://www.gisaid.org/); see Supplement). These 10 populations, distributed among 4 continents, are as likely to be independently derived as we could ascertain. The choice is based on three criteria: 1) the geography of the countries/regions and the distance between them; 2) the timing of the documented onset of the epidemic; 3) the abundance of DNA sequences. We consult the frequencies in samples collected before late March 2020, corresponding roughly to G 8 in Fig. 1 . Due to the rapidly changing data reporting (GISAID [26] ), the frequency profile of Set II is plausibly realistic as reported in mid-April (see Supplement). Readers with access to the more up-to-date data can compare the new observations with the theoretical results to improve the estimation of I t . Comparisons between simulations and data -In Fig. 2 , the Fst distributions based on the datasets of Table 1 smaller range in Fig. 5 than that in Fig. 4 . In other words, with selection, the estimated I 0 should be even smaller than indicated above. The simulation results are based on the distribution of k that follows the power law (see Eqs. 6 and 7) with V(k) ~ 10E(k). Such a large V(k) means that the genetic drift would be very large, requiring large I t 's to reduce the drift. It is hence interesting that, even under such stringent conditions, I 0 is still < 10. In the Supplement, we show that the estimated value of I 0 would be substantially lower if we use the Poisson distribution of k, associated with the conventional Wright-Fisher model. With V(k) = E(K), I 0 would be 2-4. Hence, the conclusion presented in this section is robust. In the theory of genetic drift [5] , even a hundred infected travelers from a source viral population would give rise to a fairly uniform level of genetic polymorphism among bordered regions. In contrast, the reported data indicate substantial divergence among countries (GISAID (https://www.gisaid.org/); see Supplement). Dataset II of Table 1 is realistic in this respect. The divergent polymorphisms across countries depend mainly on a critical parameter -the size of the first cohort arriving in a country, I 0 , which is estimated to be < 10. The number may in fact be smaller than it seems since a long distance flight carrying one single infected but symptomless patient could infect this many people, all of whom being without symptom upon arrival [27] . On the robustness of the estimation The model also assumes that each population is an independent sample of the source population. Since all populations are likely to exchange some individuals due to travelling, the actual divergence among populations would be even smaller than simulated. In other words, to attain the observed level of divergence, I 0 would have to be even smaller than estimated. Considering all these variables, we believe the conclusion of I 0 < 10 to be robust. In the analysis of regional divergence, the results would depend strongly on the smaller I 0 of the two regions being compared. Hence, the assumption of the same I 0 among all regions would be a reasonable one. While our focus is on the divergence in the first few generations, we now briefly discuss the subsequent evolution after this initial critical period. The primary lineage delineation, the S/L polymorphism defined by two SNPs [11] , have many subtypes (see Supplement and Table S1 for details). For example, the western European countries including Italy, Switzerland, Germany and Belgium are predominantly of the L type with a similar abundance in the L2 subtype. In contrast, while Japan is also predominantly of the L type, it has mainly the L1 subtype. This contrast suggests that Japan may represent an independent sample from the western European samples, which have likely been spreading regionally after the initial seeding. Another example is the S1 and S2 subtypes, which differentiate between the samples from China and the west coasts of the US. These patterns suggest that, after the initial seeding, each major region or continent has been evolving along an independent path. Since the initial seeding may be extremely difficult to prevent, the onus is to suppress the regional spread. The analyses of the subtypes in Asia, Australia, and various parts of North America would offer additional details of the spread of the virus, as has been done recently [28] [29] [30] [31] . These details are beyond the scope of this study that focuses on the early stages of the viral spread. The analysis suggests that the COVID-19 epidemic in each region surveyed was likely started by a very small number of travelers [I 0 < 10]. With such a tiny trickle of human movement, it would have been very inefficient for any region to prevent infected individuals from exporting an epidemic to (or importing it from) other places. For that reason, the crucial stage of repressing an epidemic in any region should be the very first sign of local contagion. Finally, due to the "portability" of COVID-19, each epidemic, including the first one on record, could have easily been imported. Where then did all these epidemics begin? While the interest in the "origin" is intense, we suggest the question be broadened as "the origin and early evolution" of SARS-CoV-2. The latter implies a process whereas the former seems to mean a single time point. The process of early evolution may have stretched over different regions in a long time-span and involved multiple host species. Like many other evolutionary questions on origin, we suggest the question be phrased as the early evolution of SARS-CoV-2, rather than be about the "origin". The former implies a process whereas the latter seems to mean a single time point. This distinction is important as seen in the debates on the "origin" of dogs [32, 33] and new species in novel environments [34] . By compressing a process into a simple "origin", we may be asking a false question about, say, "the first dog" or "the first patient". Ross, macdonald, and a theory for the dynamics and control of mosquito-transmitted pathogens Population biology of emerging and re-emerging pathogens The causes and consequences of HIV evolution A New Formulation of Random Genetic Drift and Its Application to the Evolution of Cell Populations The mathematics of infectious diseases Modeling the epidemic dynamics and control of COVID-19 outbreak in China Reproduction numbers of infectious disease models Perspectives on the basic reproductive ratio Moral imperative for immediate release of 2019-nCoV sequence data On the origin and continuing evolution of SARS-CoV-2 The Average Number of Generations until Fixation of a Mutant Gene in a Finite Population The neutral theory of molecular evolution An introduction to population genetics theory Molecular evolution Random processes in genetics Virological assessment of hospitalized patients with COVID-2019 Principles of Population Genetics Origins of power-law degree distribution in the heterogeneity of human activity in social networks Human behavior and the principle of least effort The reproductive number of COVID-19 is higher compared to SARS coronavirus Quantifying SARS-CoV-2 transmission suggests epidemic control with digital contact tracing Transmission interval estimates suggest pre-symptomatic spread of Global initiative on sharing all influenza data -from vision to reality Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (SARS-CoV-2) Evolution and molecular characteristics of SARS-CoV-2 genome Coast-to-Coast Spread of SARS-CoV-2 during the Early Epidemic in the United States Introductions and early spread of SARS-CoV-2 in the New York City Structural variation during dog domestication: insights from gray wolf and dhole genomes Structural variation provides novel insights into dog domestication Speciation with gene flow via cycles of isolation and migration: insights from multiple mangrove taxa We thank all those who have contributed sequences to the GISAID (Global Initiative on Sharing All Influenza Data) database (https://www.gisaid.org/). Acknowledgements and detailed information of the genome sequences we consulted for this study are given in Table S2 . The actual data fitted to our model were computationally generated (see the main text). We thank Drs. Yaping Zhang, Suhua Shi, Weiwei Zhai, Jianzhi George Zhang, Bingjie Chen, Zheng Hu and Jindong Zhao for comments on and suggestions for this manuscript. The authors declare no competing interest.