key: cord-0550567-ltvi36bf
authors: Groeneboom, Piet
title: Estimation of the incubation time distribution for COVID-19
date: 2020-09-26
journal: nan
DOI: nan
sha: e9f59a64102f27dd475bea40fc563b5156141cdd
doc_id: 550567
cord_uid: ltvi36bf

We consider nonparametric estimation of the incubation time distribution of COVID-19.

The Dutch Centre for Infectious Disease Control (Dutch: RIVM) analyzes in [1] a data set of 88 travelers who are assumed to have picked up the COVID-19 virus in Wuhan. The distribution of their incubation times is estimated using certain simple distributions, like Weibull, log-normal and gamma. If the only thing we know about the start of the incubation time is that it belongs to an interval [0, E i ], the log likelihood for one observation is:

Here E i would be the upper bound of an interval for the infection interval, for which we take (looking back) 0 as the left point for the ith individual (see [2] ), S i ≥ E i is the time where the person becomes symptomatic (note that S i = E i is possible), and F i would be the distribution function of the time of a possible contact with an infector. The exit times and times of becoming symptomatic of the 88 Wuhan travelers are shown in Table 1 .

It is clear that, without further assumptions, g and F i are not identifiable. To remedy this, we assume, as in [1] (see also [9] ), that F i is the uniform distribution on [0, E i ]. If we want to use maximum likelihood, we have to maximize n i=1 log Ei t=0 g(S i − t) dt/E i , and since the E i do not matter in the maximization problem, we end up with the problem of maximizing

where G is the incubation time distribution function. The algorithms we used for analyzing the data set can be found on [3] . We describe the data files given there. The original data file is data_Wuhan_tsv, which gives details on persons in the sample and which can be found in [1] . This was transformed into a data file transformed_data_Wuhan.txt, consisting of three columns, giving, respectively, the arrivals in and departures from Wuhan and the time the person became symptomatic. If the arrival time was unknown, this time was set to −18, which means 18 days before December 31, 2019, which is the zero on the time scale. For traveler number 67, who apparently had a connecting flight, the duration of stay in Wuhan was changed from 0 to 1 day. This, in turn, was transformed into the input file inputdata_Wuhan.txt, where the time, spent in Wuhan, was shifted making the left point equal to zero, and consists of two columns: the first colums contains the data S i − E i (time of becoming symptomatic minus exit time from Wuhan) and S i , time of becoming symptomatic, where all times are shifted to have entrance time zero. If the person became symptomatic in Wuhan we put S i − E i = 0.

Assuming that the distribution of the possible time of infection is uniform on the exposure interval, and estimating the distribution function G by the Weibull distribution, parametrized as we get as our maximum likelihood estimaters of the parameters a and b: a = 3.03514,b = 0.002619.

(3)

Using the Weibull maximum likelihood method, the estimate was computed by two methods. One is a very simple method using Weibull.cpp, which is used in analysis_EM.R and analysis_ICM.R, where also the nonparametric estimate to be discussed in the next sections is computed. For this "pattern search" algorithm for looking for the parameters of the Weibull distribution one does not have to compute the derivatives of the log likelihood. It is based on the Hooke-Jeeves algorithm. The other one can be found in R_Weibull_Estimation.R, where we use the R package lbfgs, and where the gradient (derivatives of the log likelihood) has to be provided.

The results obtained for the Weibull distribution approach of the two algorithms are remarkably similar. The values in (3) were produced by the R script in [3] , using the Hooke-Jeeves algorithm. For a convergence proof of the Hooke-Jeeves algorithm and interesting further discussion of the pattern search algorithms, see [8] and [11] . 

The EM iterations for the MLE maximizing (1), without making this parametric restriction, are in this case given by:

where the ratios are zero if the denominators are zero. The implementation of this algorithm for the present situation can be found in analysis_EM.R in [3] . The EM iterations were started with the discrete uniform distribution on the 43 points 1, . . . , 43, which corresponds to the range of values (days) in Table 1 , but withdrew its mass after 10, 000 iterations to the 7 points 3, . . . , 9, which leads to the discrete distribution function, shown in Figure 1 . A bar chart of the corresponding probability masses is shown in Figure 2 . It is seen that this is a bimodal discrete probability distribution with modes at resp. 4 and 9 days, with the highest value at the second mode. This discrete probability distribution is also given in Table 2 . The iteration steps (4) folllow from the so-called self-consistency equations, which are derived by differentiating the criterion function 

and multiplying these relations with p j and summing over j yields λ = 1, using the side condition (6) . But the relations (7) only hold for the active (in this case 7) parameters p i > 0 of the solution; in the iterations (4) the inactive parameters p i will tend to zero. For more details, see, e.g., [7] , Section 7.2.

Because of the monotonicity of the distribution function G, maximizing the log likelihood over all distribution functions G is an isotonic regression problem, which can be solved by specific isotonic methods. In the present case we can apply the iterative convex minorant algorithm, discussed in [7] , Section 7.3.

The log likelihood is of type:

where k i is the number of observations (T i , U i ), and where

where n = 88, and where V i is the infection time, W i the incubation time and, as before, E i the exit time of the travelers from Wuhan, where all observations are centred by subtracting the entrance time .

We first make the so-called preliminary reduction to reduce the problem to a maximization problem in the interior of a convex cone of type y = (y 1 , . . . , y m ) T : 0 < y 1 ≤ · · · ≤ y m .

For the Wuhan data set it can be checked that, without loss of generality, G(i) = 0, i ≤ 2, and G(i) = 1, i ≥ 9, since in this case values strictly between 0 and 1 can only make the likelihood smaller. If we make this preliminary reduction, the log likelihood for the ordered parameters y i , representing the values of the distribution function G at the observation points, becomes:

where y i = G(i + 2), i = 0, . . . , 7, y 0 = 0, y 7 = 1, and where the triangular array (N ij ), 0 ≤ i < j ≤ 7, is given by: We have to maximize (8) under the restriction 0 < y 1 ≤ · · · ≤ y 6 ; by the preliminary reduction, we lost the additional condition y 6 < 1. Let y = (y 1 , . . . , y 6 ) T . The (Fenchel) sufficient and necessary conditions for the solution are:

where f is defined by (8) . Since the values y i are strictly between 0 and 1, (12) can only hold if also

∂ ∂y i f (y) = 0, and we can therefore turn (13) into

The resulting (nonparametric) MLEF n is shown in Figure 1 , together with the MLE assuming that G is a Weibull distribution. The EM algorithm and the iterative convex minorant (ICM) algorithm give exactly the same solutions, but the ICM algorithm needs less iterations (106 in this case; the EM algorithm needs between 1000 and 10, 000 iterations).

To compute the MLE via the iterative convex minorant algorithm, we have to construct so-called cusum (cumulative sum) diagrams. The cusum diagram consists of the point (0, 0) and the points

where

At each iteration step the left derivative vector y of the greatest convex minorant of the cusum diagram is computed on the basis of the current value y, and the stationary point of this iteration is the solution of the optimization problem. We perform line search in case the full step to y would not lead to improvement or would go out of bounds (which does not happen in the present case). For more theory, see [7] . As in [7] , section 1.2, we can compute the smoothed maximum likelihood estimator (SMLE) and also an estimate of the density. The SMLE is defined bỹ

where h > 0 and K is an integrated kernel

Here K is a symmetric kernel with support [−1, 1], for example the triweight kernel

We estimate of the density byf

For the present analysis we took h = 3.6 in (16) and h = 4.6 in (19); these bandwidths were chosen by a bootstrap method, explained in Section 3. The resulting estimates are shown in Figure 3 .

Let the random variables E i with values on the integers ("days") on the interval [1, 42] represent the exit times ("length of stay in Wuhan"). A histogram of the 88 exit times E i for the travelers from Wuhan is shown in Figure 4 . Furthermore, let V i denote the (unknown) infection time, which we take, conditionally on E i , to be uniform on [0, E i ], and let W i denote the (again unkown) incubation time. Our observations are the pairs (9). To determine the bandwidth h of our density estimator

whereF n is the MLE of the distribution function F of the incubation time, we follow a method somewhat similar to the method used in [10] . We take B = 10, 000 bootstrap samples of observations (E i , U * i ), corresponding to the observations (E i , U i ), where the E i are the (actual) exit times of which Figure 4 shows the histogram, and where the U * i are generated as the sums (rounded to the nearest integer) of a Uniform(0, E i ) random variable V * i and a random variable W * i , generated from the densityf nh0 by rejection sampling for a fixed h 0 , for which we took h 0 = 4 in the present case. Note that we keep the E i the same as in the original sample, somewhat analogously to the procedure followed in [10] , which relieves us from the duty to estimate the exit time distribution.

Next we computedM

The resulting loss functionM SE f (h) is shown in Figure 5 , which gave as the minimizing bandwidthĥ ≈ 4.6. Taking h 0 = 3 in our function of referencef nh0 gave the same minimizing value. The (approximate) independence of the starting value h 0 was also observed for the analogous bandwidth selection procedure in [10] . Similarly, we computedM

as a function of h by the same bootstrap procedure, whereF * nh was computed for the bootstrap samples. The integrals were approximated by Riemann sums with step size 0.1 on the interval [0, 14]. The R scripts for this procedure can again be found on [3] .

The method used here is called the "smoothed bootstrap". Another option is to use the ordinary bootstrap with smaller sample sizes (since otherwise for this method the bias is not estimated in the right way). This method was for example applied in [6] and implemented in the R package curstatCI ( [5] ) for the current status model. We tried this method out, but found its behavior not to be satisfactory in the present case, possibly because of the small sample size.

We offered an alternative nonparametric approach to the estimation of the incubation time distribution which was estimated by parametric methods in [1] for a data set of travelers from Wuhan. In this way we do not have to choose a parametric distribution, like the Weibull, log-normal or gamma, as in [1] , but compute a nonparametric maximum likelihood estimate instead which does not need the arbitrary choice of parameters at all. However, to give a smooth estimate of the distribution function and (continuous) density, we have to choose a bandwidth parameter. For this choice a smoothed bootstrap approach was given, which makes this choice data-adaptive. The present paper can be considered to be the technical companion of the column [4] to which, however, the automatic bandwidth choice was added. All computations are given as R scripts in [3] .

Incubation period of 2019 novel coronavirus (2019-nCov) infections among travellers from Wuhan, China

Estimation in emerging epidemics: bases and remedies

The Netherlands in Times of Corona

The nonparametric bootstrap for the current status model

Nonparametric Estimation under Shape Constraints

Optimization by direct search: new perspectives on some classical and modern methods

Estimating incubation period distributions with coarse data

Model based bootstrap methods for interval censored data

On the convergence of pattern search algorithms