key: cord-0749316-vkgnwxzc authors: Han, Henry title: Estimate the incubation period of coronavirus 2019 (COVID-19) date: 2020-02-29 journal: nan DOI: 10.1101/2020.02.24.20027474 sha: c1ae608c7ffb926a0f50a6a34c0780983274ea74 doc_id: 749316 cord_uid: vkgnwxzc Motivation: Wuhan pneumonia is an acute infectious disease caused by the 2019 novel coronavirus (COVID-19). It is being treated as a Class A infectious disease though it was classified as Class B according to the Infectious Disease Prevention Act of China. Accurate estimation of the incubation period of the coronavirus is essential to the prevention and control. However, it remains unclear about its exact incubation period though it is believed that symptoms of COVID-19 can appear in as few as 2 days or as long as 14 or even more after exposure. The accurate incubation period calculation requires original chain-of-infection data that may not be fully available in the Wuhan regions. In this study, we aim to accurately calculate the incubation period of COVID-19 by taking advantage of the chain-of-infection data, which is well-documented and epidemiologically informative, outside the Wuhan regions. Methods: We acquired and collected officially reported COVID-19 data from 10 regions in China except for Hubei province. To achieve the accurate calculation of the incubation period, we only involved the officially confirmed cases with a clear history of exposure and time of onset. We excluded those without relevant epidemiological descriptions, working or living in Wuhan for a long time, or hard to determine the possible exposure time. We proposed a Monte Caro simulation approach to estimate the incubation of COVID-19 as well as employed nonparametric ways. We also employed manifold learning and related statistical analysis to decipher the incubation relationships between different age/gender groups. Result: The incubation period of COVID-19 did not follow general incubation distributions such as lognormal, Weibull, and Gamma distributions. We estimated that the mean and median of its incubation were 5.84 and 5.0 days via bootstrap and proposed Monte Carlo simulations. We found that the incubation periods of the groups with age>=40 years and age<40 years demonstrated a statistically significant difference. The former group had a longer incubation period and a larger variance than the latter. It further suggested that different quarantine time should be applied to the groups for their different incubation periods. Our machine learning analysis also showed that the two groups were linearly separable. incubation of COVID-19 along with previous statistical analysis. Our results further indicated that the incubation difference between males and females did not demonstrate a statistical significance. Novel coronaviruses (COVID- 19) , which was found in Wuhan, China in December 2019 presents an acute public health threat to the whole world [1] [2] . The new virus is different from known coronaviruses such as SARS and MERS, though they share some similar respiratory illness symptoms such as fever, cough, or/and shortness of breath [2] . It is believed to root from the animal but spreads from person-to-person. The COVID-19 spread even shows that persons without any symptoms or clinically negative in infection can still spread it to others. There is no official vaccine or antiviral drug available up to now (Jan 30, 2020) to treat COVID-19 infected patients [3] [4] . The outbreak of COVID-19 infection is forcing China and many countries to take harsh protection policies. More than eight-thousands of infections have been reported in China and more than a dozen of countries until Jan 30, 2020. More than 15 cities including Wuhan have been quarantined to halt the spread of the COVID-19. It is expected that millions of people can be on lockdown because of COVID-19. WHO declared the coronavirus outbreak a global health emergency on Jan 30, 2020. It is essential to know the accurate incubation period of COVID-19 for the sake of deciphering dynamics of its spread. The incubation period is the time from infection to the onset of the disease. It provides the foundation for epidemiological prevention, clinical actions, and drug discovery. Different viruses have different incubation periods that determine their different dynamics epidemiologically. The incubation period of H7N9 (Human Avian Influenza A) is about 6.5 days, but the incubation period for SARS-CoV is typically 2 to 7 days [5] [6] . However, it remains unclear about its exact incubation period of COVID-19, although WHO estimates it is between 2 to 14 days after exposure [8] . It can be difficult to estimate the All rights reserved. No reuse allowed without permission. author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.02.24.20027474 doi: medRxiv preprint incubation period of COVID-19 by using original chain-of-infection data that may not be fully available in the Wuhan regions. Or data may lack meaningful exposure history. Furthermore, it is also unknown whether the incubation time will show some statistically significant with respect to Age and Gender. In this study, we aim to accurately estimate the incubation period of COVID-19 by taking advantage of datasets with a well-documented history of exposure. Our results show the incubation mean and median of COVID-19 are 5.84 and 5.0 days respectively and there is a statistical significance with the role of gender. However, the incubation periods of the groups with age>=40 years and age<40 years show a statistically significant difference. Our machine learning analysis also shows that the two groups are linearly separable that demonstrate a clear boundary in knowledge discovery visualization. We collected a dataset with 59 officially confirmed COVID-19 cases from 10 regions in China except for Hubei province, the assumed origin of the virus. The patient data was dated from Dec 29, 2019, to Feb 5, 2020. We only involved the officially confirmed cases with a clear history of exposure and time of onset in data collection. We exclude those without relevant epidemiological descriptions, working or living in Wuhan for a long time, or hard to determine the possible exposure time. Data collected for this study included region, age, gender, exposure history, and illness onset. For those cases whose incubation periods locate in an interval [ ! , " ], we use its midpoint = # ! $# " " to represent its incubation period. For example, Case no. 2 in our dataset went on a business trip in Wuhan on Jan 12 th , 2020 and returned to Shaanxi on Jan 15 th , 2020, but had fever symptoms on Jan 20 th 2020. The incubation will be calculated as = We propose a Monte Carlo simulation approach that takes advantage of bootstrap techniques to estimate incubation median and mean estimation for the small sample with 59 cases. It is more data-driven compared to traditional parametric approaches to handle parameter estimation for small datasets. The proposed Monte Carlo simulation assumes we have a collected small incubation dataset . We generate a All rights reserved. No reuse allowed without permission. author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.02.24.20027474 doi: medRxiv preprint large incubation sample * by concatenating n=1000 randomly sampled incubation segments *+ each of which contains at least entries in the interval [ ! , " ] drawn independently from the existing dataset , i.e., * = ⋃ *+ . Then the large incubation sample median is calculated: where (0,1) is added Gaussian noise and ∈ [0,1] is a variance control parameter in simulation. Such a procedure is repeated times, the population median . = The estimated standard deviation . is calculated as the standard deviation of the median sequence ! , " , ⋯ / . In our simulation, we chose = 100000, ! = 1, " = 7, = .2 days in simulation and conduct simulations by employing Google Colab with TPU acceleration [9] [10] . The confidence level probability is calculated by the ratio / , where counts the times that * falls in the interval F . − 2 . , . + 2 . H in the simulation. Similarly, we can estimate the population mean by using the same way where the large incubation sample mean is calculated as The population mean is estimated as = ! / ∑ * / *-! and the estimated standard deviation . According to the central limit theorem, the population mean will be subject to the normal distribution. The confidence interval probability is calculated by following the same procedure. Our data consists of 34 male cases, 24 female cases and 1 unidentified gender case from ten regions in China. All 59 cases have complete epidemiological descriptions about the history of exposure. The total 57 cases have complete information in age and gender One case from Beijing has unknown age. The mean and standard deviation of age are 41.9 and 13.2 years old. The mode of his dataset is 4.0 with 14 support cases. The minimum and maximum age are 10 and 70 respectively. author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.02.24.20027474 doi: medRxiv preprint Table 1 shows incubation statistics for five different groups: Male, Female, Age>=40, Age <40, and All that include all cases. It indicates that the incubation period median and mean of patients more than 40 years old are greater than those of patients less than 40 years old. Similar patterns are also observed for male and female groups. The mean values are always larger than median values for each group suggests the rightskewed distributions of incubation. author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.02.24.20027474 doi: medRxiv preprint The incubation of COVID-19 is not subject to neither of the widely used incubation distributions such as normal, lognormal, Gamma, and Weibull distributions well [12] . We employ Shapiro-Wilk tests rather than Kolmogorov-Smirnov tests to conduct normality tests because we only have 59 records and Shapiro-Wilk tests can do a much better job on small datasets with a sample size of from 3 to 5000 [13] . Table 2 shows the All rights reserved. No reuse allowed without permission. author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.02.24.20027474 doi: medRxiv preprint p-values under the Shapiro-Wilk tests for normal and lognormal distributions as well as Goodness-of-fit tests for Gamma and Weibull distributions by using the R package 'goft' [14] . We can reject normal, lognormal, and Weibull distributions strongly under the significance level of 0.05 cutoff [11] [12] . Although we can't reject Gamma distributions for its boundary line p-value (0.06086), it can be risky to use it to fit and estimate the distribution of the incubation period under a small sample size. We further employ the maximum likelihood estimates (MLEs) for the parameters of the gamma, conduct the goodness-fit test, and obtain p-value=8.6807e-04. As such, we only rely on nonparametric techniques in data analysis rather than use any pre-assumed distributions. Mann-Whitney rank test shows that there are no significant differences between the incubation of males Pearson correlation coefficient analysis shows the R statistics is 0.244 with p-value: 0.06758. It suggests that that the incubation period is somewhat correlated with age though not that strong. To verify whether the incubation of the age>=40 group is different from that of the age<40 group statistically, Figure 3 compares All rights reserved. No reuse allowed without permission. author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.02.24.20027474 doi: medRxiv preprint their incubation periods groups using different visualization tools. It indicates that the younger group tends to have a shorter incubation period. The variance of their incubation period also seems to be smaller. The Mann-Whitney rank test shows that there's a statistically significant difference between the incubation of age<40 and age>=40 groups with the null hypothesis: the medians of incubation period between two groups are the same. The p-values for corresponding alternative hypotheses: the age<40 group has a smaller incubation median is 0.00474. It suggests that the age<40 group has a shorter incubation period than the age>=40 group. Similarly, the Siegal-Tukey test indicates the younger group's incubation variance is less than the older group's by p-value: 0.0083. It may suggest COVID-19 has a faster but relatively constant spread speed among people <40 years old than people >=40 years old. Figure 4 illustrates biplots of the dataset by removing two cases with missing items by using PCA (principal component analysis), sparse PCA (sparse principal component analysis, t-SNE (t-distributed stochastic neighbor embedding), and LLE (locally linear embedding) [16] [17] [18] [19] . Data is partitioned as the age>=40 and age<40 groups in visualization. t-SNE shows that only one case in the age>=40 group falls in the cluster of the age < 40 group. But PCA, sparse PCA and LLE all show that the incubation of two groups is linearly separable, which means there exists an obvious linear boundary to separate them, in the subspaces generated by PCA, SPCA, t-SNE, or LLE. Such machine learning results indicate that the two groups are actually independent clusters spatially. However, we also find that the incubation data will no longer demonstrate the linear separability property when we partition it as age>=50 and age<50 groups or age>=55 All rights reserved. No reuse allowed without permission. author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.02.24.20027474 doi: medRxiv preprint and age<55 groups. It suggests that age 40 can be a key age cutoff for the incubation of COVID-19 along with previous statistical analysis. We further estimated population mean and standard deviation, median and 2.5 th and 97.5 th percentile by using Bootstrap under 10 6 times resampling. Table 3 author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.02.24.20027474 doi: medRxiv preprint [21] . It suggests that COVID-19 could have a faster distribution speed than H7N9, but the same spread speed as SARS and MERS in terms of their incubation periods [22] . The existing spread of COVID-19 is faster than SARS partially because it has more complicate spread dynamics [2] . For example, All rights reserved. No reuse allowed without permission. author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.02.24.20027474 doi: medRxiv preprint those without clinical symptoms can still spread the virus even if they are 'officially negative' in the COVID-19 virus test [22] . We also investigate the incubation period of 12 family cases and 47 non-family cases in the dataset. The family cases simply refer to the patients who were caught by COVID-19 because their family members had been infected. The Mann-Whitney rank test shows that there are not significant differences between family patients and non-family patients in terms of the median of incubation period. The Siegal-Tukey tests on scale also verify that the incubation scales are at the same level for family patients and non-family patients. Our studies indicate that incubation periods of the age>=40 years and age<40 years groups not only statistically significant but also linearly separable in machine learning. It may suggest different treatments should be considered for the two different groups. It will be more interesting to estimate different incubation time for them separately. That the estimated 97.5 th percentile of COVID-19 incubation is 12.89 days (95% CI: (11.00, 16.13)) may suggest a long isolation or quarantine time (e.g. 17 days) can be better than the widely accepted 14 days. Furthermore, different quarantine time should be applied to the age>=40 years and age<40 years groups for their different incubation periods. Generally speaking, a longer quarantine time can be considered for the old patients (>=40 years) than young patients (<40 years old). Our ongoing work is to collect more qualified data to extend our existing results and investigate incubation of COVID-19 for different groups besides comparing our incubation estimation with other studies [23] . A novel coronavirus genome identified in a cluster of pneumonia cases -Wuhan Clinical features of patients infected with 2019 novel coronavirus in Wuhan All rights reserved. No reuse allowed without permission author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint (which was not peer-reviewed) is the The Extent of Transmission of Novel Coronavirus in Wuhan, China, 2020 Lin Yang and Daihai He (2020) Preliminary estimation of the basic reproduction number of novel coronavirus in China, from 2019 to 2020: A data-driven analysis in the early phase of the outbreak. bioRxiv Comparative epidemiology of human infections with avian influenza A H7N9 and H5N1 viruses in China: a population-based study of laboratory-confirmed cases Estimation of the incubation period of influenza A (H1N1-2009) among imported cases: addressing censoring using outbreak data at the origin of importation First travel-related case of 2019 novel coronavirus detected in United States Kolmogorov-Smirnov one-sample test Hiroshi Nishiura Early efforts in modeling the incubation period of infectious diseases with an acute course of illness A modification of the test of Shapiro and Wilk for normality Statistical analysis of ordinal categorical status after therapies Nonnegative principal component analysis for cancer molecular pattern discovery Application of t-SNE to human genetic data All rights reserved. No reuse allowed without permission author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint (which was not peer-reviewed) is the Structured Sparse Principal Component Analysis Middle East respiratory syndrome coronavirus (MERS-CoV) entry inhibitors targeting spike protein Estimating the Distribution of the Incubation Periods of Human Avian Influenza A(H7N9) Virus Infections Early Transmission Dynamics in Wuhan, China, of Novel Coronavirus-Infected Pneumonia Clinical characteristics of 2019 novel coronavirus infection in China All rights reserved. No reuse allowed without permission. author/funder, who has granted medRxiv a license to display the preprint in perpetuity.The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.02.24.20027474 doi: medRxiv preprint