key: cord-0498616-lv3d1c95 authors: Gourieroux, Christian; Jasiak, Joann title: Time Varying Markov Process with Partially Observed Aggregate Data; An Application to Coronavirus date: 2020-05-09 journal: nan DOI: nan sha: 9ea60fa23541771c991f7dee4305795a868bf8d3 doc_id: 498616 cord_uid: lv3d1c95 A major difficulty in the analysis of propagation of the coronavirus is that many infected individuals show no symptoms of Covid-19. This implies a lack of information on the total counts of infected individuals and of recovered and immunized individuals. In this paper, we consider parametric time varying Markov processes of Coronavirus propagation and show how to estimate the model parameters and approximate the unobserved counts from daily numbers of infected and detected individuals and total daily death counts. This model-based approach is illustrated in an application to French data. The aim of this paper is to address the problem of partial observability, encountered recently in epidemiological research on Covid-19. More specifically, some individuals are infected and asymptomatic. Therefore, they remain undetected and not recorded 1 . As a consequence, the total count of recovered and immunized individuals is also unknown, as only the number of recovered and detected individuals is available. This problem of partial observability of counts renders difficult the estimation of an epidemiological SIRD (Susceptible, Infected, Recovered, Deceased) model, extended to disentangle the infected and undetected from the infected and detected individuals. Moreover, such substantial undocumented infection can facilitate the rapid propagation of the virus (Li et al.(2020) ). There are two solutions to this difficulty: The first one is to sample the population daily and perform serological tests on those samples in order to estimate the proportions of infected and undetected and of recovered individuals. The current epidemic shows that validating and producing reliable serological tests can take time. Moreover, regularly performed samplings can be costly especially in terms of time of health care providers. The second approach, proposed in this paper, is purely model-based. Loosely speaking, under the standard extended SIRD model, the evolution of death rates might be different, depending on whether all infected individuals are detected or not. This implied difference will allow us for a model-based estimation of the proportions of infected undetected individuals (resp. recovered immunized) [see, Verity et al.(2019) for pure model based estimation of coronavirus infection, Manski,Molinari (2020) for set estimation of the infection rate]. The paper discusses the general case of time varying Markov processes when aggregate counts are partially observed. Section 2 describes the latent model of qualitative individual histories. These histories follow a time varying Markov process with transition probabilities that can depend on latent counts and unknown parameters. The observations are functions of the frequencies of the different states (compartments in epidemiological terminology), although not all of those frequencies are observed, in general. More specifically, only some states can be observed and/or a sum of frequencies over subsets of states can be observed. Section 3 introduces the estimation method, which jointly estimates the unknown parameters and the unknown state probabilities. We derive the asymptotic properties of the estimators under identification. Identification, which is the main challenge of the proposed approach is the topic of Section 4. First, we discuss the identification in a homogeneous Markov, i.e. when the transition matrix is not time varying. Without additional restrictions on the transition probabilities, that model is not identifiable and the proposed approach cannot be used. However, it is not the case for a time varying Markov process that includes contagion effects and, and in particular for the SIR-type models used in epidemiology. The estimation approach is illustrated in Section 5 with a SIR type model for French data. Section 6 concludes. Some technical problems are discussed in the Appendices. We consider a large panel of individual histories Y i,t , i = 1, ..., N, t = 1, .., T ,where the latent variable is qualitative polytomous with J alternatives denoted by j = 1, .., J. Assumption A1: The individual histories are such that: i) The variables Y i,t , i = 1, .., N , at t fixed, have the same marginal distributions. This common marginal distribution is discrete and summarized by the J-dimensional vector p(t), with components: ii) The processes {Y i,t , t = 1, .., T }, i = 1, .., N are independent (heterogeneous) Markov processes with transitions from date t − 1 to date t summarized by a J × J transition matrix P [p(t − 1); θ], parametrized by θ. This matrix is such that each row sums up to 1. Thus, we consider a discrete time model applicable to data on a homogeneous population of risks. Let f (t) denote the cross sectional frequency, i.e. the sample counterpart of p(t). It follows from the standard limit theorems that: Proposition 1: Under Assumption A1 ,the f (t)'s are asymptotically normal for large N . This specification of the transition matrix includes the homogeneous Markov chain, when there is no effect of lagged p(t−1). It also includes the standard contagion models of SIR type used in epidemiology [see, McKendrick (1926) , Kermack, McKendrick (1927) for early articles on SI and SIR models in the literature, Hethcote (2000), Brauer et al.(2001) , Vinnicky, White (2010) for general presentations of epidemiological models, Allen (1994) for their discrete time counterparts, Gourieroux, Jasiak (2020) for an overview, and also examples given below]. As vectors p(t) change over time, stationarity is not assumed. In practice, the individual histories may not be observed, while cross-sectional frequencies are available. These can be the frequencies f (t), t = 1, ..., T , or aggregates of such frequencies. Assumption A2: The observations are:Â t = Af (t), t = 1, .., T , where A is a K × J state aggregation matrix, that is a matrix with rows containing zeros and ones. The aggregation matrix is known and of full rank K. Example 1: When A = Id, all f (t)'s are observed. This is the case considered in McRae (1977), Miller, Judge (2015) . Example 2: In a model of the coronavirus propagation, the following 5 individual states can be distinguished: 1 = S, for Susceptible, 2 = IU , for Infected and Undetected, 3 = ID for Infected and Detected, 4 = R for Recovered, and 5 = D for Deceased. The observed frequences are f 3 (t) and f 5 (t), but not the other frequencies. We have a 2 × 5 matrix A given by: that characterizes the selection of the frequencies. Example 3: In some other applications, matrix A truly aggregates the frequencies, as, for instance, in cascade processes and percolation theory applied to an epidemiological model 2 . Let us consider a country with two regions and a SI model distinguishing these regions. We get a 4 state model: 1=S1, susceptible in region 1, 2=S2, susceptible in region 2, 3=I1, infected in region 1, 4=I2, infected in region 2. A propagation model can be written at a disaggregate level to account for both propagation within and between regions. Thus, there is a competition between the contagions coming from regions 1 and 2. However, only aggregate data for the entire country may be available. Then, the aggregating matrix A is equal to: Although the process of aggregate counts: f 1 (t)+f 2 (t), f 3 (t)+f 4 (t) may not be Markov, in general, it is important to consider the special case when it is, and then explore the possibility of identifying the parameters of regional, i.e. disaggregated dynamics. This is exactly the objective of percolation theory [see, Garet, Marchand (2006) for a detailed analysis of competing contagion sources]. Under Assumption A1, we can use the Bayes formula to relate the marginal theoretical probabilities p(t) to the transition probabilities by: The nonlinear recursive equation (3.1) is the discrete time counterpart of the deterministic differential system commonly used in epidemiology [see, Gourieroux, Jasiak(2020) ]. These equations will be used as the estimating equations in the asymptotic least squares estimation method outlined below 3 . In our framework, the parameter of interest includes θ as well as the sequence of vectors p(t). They can be jointly estimated from the following optimization: where ||.|| denotes the Euclidean norm. Proposition 2: If the constrained optimization given above has a unique solution that is continuously differentiable with respect to Af (t) =Â t , t = 1, ..., T , then the estimator is asymptotically consistent, converges at rate 1/ √ N , and is asymptotically normally distributed. The expression of the asymptotic variance-covariance matrix is derived in Appendix 2. If A = Id, that is, if all frequencies are observed, we obtain the case analysed in McRae (1977) . In the general framework, this optimization is not only used to estimate parameter θ, but also to approximate the unobserved marginal probabilities. The condition given in Proposition 2 is an identification condition, which is discussed in detail in the next section. In this section we discuss the (asymptotic) identification corresponding to the objective function given in Section 3. This objective function has a simple form, as it is quadratic under linear constraints with respect to the sequence p(t). This allows us for an optimisation in two steps: first with respect to the p(t)'s, and next, with respect to θ after concentrating. This is the approach used below for identification 4 . By taking into account the fact that probabilities sum up to one, we can compare the number of moment conditions equal to (J − 1)(T − 1) with the number of parameters of interest that is (J − K − 1)T + dim θ, iff KT ≥ dim θ. Therefore the order condition is satisfied iff the number of days T is sufficiently large. Let us now consider the rank condition. For ease of exposition, we consider a homogenous Markov model with 3 states: J = 3. The parameter θ includes the elements of the transition matrix P , which has 6 independent components, given that each row of P sums up to 1. We assume that the observed marginal probabilities are p 3 (t), t = 1, .., T . The Bayes formula: leads to 2(T − 1) independent moment restrictions that are the estimating equations: or equivalently, To discuss identification, we look for the solutions in P and p(t), t = 1, ..., T of system (4.2) written for t = 2, ..., T . We have the following result: Proposition 3: Generically, i.e. up to a (Lebesgue) negligible set of parameter values,and if T ≥ 6, we have that: i) Parameter P is not identifiable, with an under-identification order equal to 3. ii) There exist 3 functions of P that are identifiable. These functions are independent of T . iii) These functions are over-identified with an over-identification order equal to T − 5. Proof: The proof is based on a concentration with respect to the values of p 2 (t). From the second equation of system (4.3), we see that p 2 (t − 1) is a linear affine function of p 3 (t), p 3 (t − 1), with coefficients that depend on P . These linear affine expressions can be substituted into the first equation of system (4.3) to show that the observed sequence p 3 (t) satisfies a linear affine recursion of order 2: with coefficients that depend on P . The results follow since: i) the functions a(P ), b(P ), c(P ) are identifiable; ii) the degree of under-identification of P is: 6-3=3; iii) the degree of over-identification of the identifiable parameters is: Appendix 3 provides the expressions of functions a(P ), b(P ), c(P ) and points out that Proposition 3 does not apply only under circumstances that are negligible. In particular, identification requires that observations p 3 (t) correspond to a nonstationary episode as shown in the remark below. Remark 1: Let π denote the stationary probability solution of the Markov chain, defined by: If the observed p 3 (t) = π 3 were associated to a stationary episode, the only identifiable function of parameters would be π 3 (P ) and the under-identification degree would be equal to 6-1=5. Therefore, observing the process during a nonstationary episode provides a gain of 2 identification degrees. Remark 2: If the Markov structure is recursive, that is, if matrix P is upper triangular, the under-identification degree becomes 3-3=0, and the parameter is generically identifiable. Proposition 3 shows that we can expect to identify the parameter of interest if we either consider a) a homogeneous Markov and constrain the parameters as illustrated in Remark 2 by an example of the recursive system,, or b) a non-homogeneous Markov discussed in the next subsection. Let us now consider an epidemiological model with J = 3 states to facilitate the comparison with Section 4.2. The states are: 1=S for susceptible, 2=I for infectious (individuals stay infectious, even if they recover), 3=D for deceased. The rows of the transition matrix are the following: (1−p 13 )logist(a 1 +a 2 p 2 (t−1)); p 13 row 2 = I : 0; 1 − p 23 ; p 23 ] is the logistic function, i.e. the inverse of the logit function. We obtain a triangular transition matrix with state D as an absorbing state. The contagion effect is characterized by parameter a 2 and follows a nonlinear logistic function. We also expect that mortality rate p 23 is strictly larger than mortality rate p 13 . There are 4 independent parameters in θ = [a 1 , a 2 , p 13 , p 23 ] . Proposition 4: The SID model given above is generically identifiable. Parameter θ is over-identified with an over-identification order equal to 5. Proof: The proof is similar to the proof of Proposition 3. The two independent moment conditions are: From the second equation, it follows that p 2 (t − 1) is a linear affine function of p 3 (t) and p 3 (t − 1). Next by substituting into the first equation, we find that the observed p 3 (t) satisfies a nonlinear recursive equation of order 2 of the type: If T is sufficiently large, this nonlinear observed dynamics allows us to identify 9 nonlinear functions of parameter θ. Thus parameter θ is identifiable with an over-identification order equal to 5. Q.E.D. From Remark 2, we expected that the triangular form of the transition matrix alone would facilitate the identification. However, the order of over-identification reveals the additional role of the contagion effect. The nonlinear dynamics induced by the logistic transformation also facilitates identification. This section illustrates the estimation approach and its performance in an epidemiological model. We consider a model with 5 states: 1=S, 2=IU, 3=ID, 4=R ,5=D, and the following rows of the transition matrix: where the π 1jt , j = 1, 2, 3 sum up to 1, and are proportional to: 1984) ]. The propagation parameters allow for different impacts of p 2 (t − 1) and p 3 (t − 1), since the detected individuals are expected to be self-isolated more often. There is no contagion effect from the recovered R, who are assumed no longer infectious 5 . The structure of zeros in the transition matrix indicates that one cannot recover without being infected, one cannot be infected twice (as expected for COVID 19) and death is considered as an absorbing state. This is a parametric model with 6+7=13 parameters, that are the 6 parameters a l , b l , c l , l = 1, 2 and 7 independent transition probabilities. Among the 5 series of frequencies f j (t) j = 1, .., 5 that sum up to 1 at each date, f 3 (t) and f 5 (t) of infected detected and of deceased, respectively, are assumed to be observed. The frequencies f 2 (t) and f 4 (t) are unobserved and will be considered as additional quantities of interest to be estimated jointly. They are crucial for a model-based reconstituting of counts of infected undetected and of recovered immunized individuals. As illustrated in Section 4.3, the triangular form of the transition matrix and the nonlinear doubly logistic contagion model provide generic identification. The model above can be used for simulating the Covid-19 propagation for given values of parameter θ and initial value p(1). These values are set as follows: The daily mortality rates are: We assume, there are about 3 times more transition to IU than to ID,i.e. Rate p 23 is fixed equal to p 12 = 1e − 06, which is the detection rate exp(a 2 ). Coefficient a 2 is strictly positive. This means that there can exist exogenous sources of infections for the population of interest, either from animals to humans, or more importantly from humans of another population to humans in the population of interest, due to tourism and migration. We consider an open economy from the epidemiological point of view 6 . We do not account for the increasing effort of performing daily tests during the epidemic (its outcome has not been substantial in France during this period, due to the pressure on the market of test components 7 ). Next, the parameters corresponding to the diagonal transition probabilities are computed from the unit mass restriction on each row. All probabilities of transitions out of the diagonal are all very small as a consequence of the daily frequency of our data. The initial marginal probabilities are set equal to: p 0 (t) = (1, 0, 0, 0, 0), that corresponds to an initial population with no prior infection from the coronavirus in this population. Thus, the first cluster will be due to travellers entering the country. Two types of dynamic analysis can be performed, depending whether the sequence of The two Infected states IU and ID are flows, as they are observed between the times of entry in, and exit from the state of infection. Moreover, the probability of exiting after 20 days is very close to 1. We usually expect a "phase" transition effect: For small t, these counts increase quickly as new infected individuals are cumulated without a sufficiently high number of exits to compensate for the arrivals. This explain an increase of the curves at the beginning of the period. After that initial period, the counts of exits tend to grow and offset the new arrivals so that the curves tend to flatten. More precisely, they continue to increase, due to the propagation effect, but at a very low rate. This is the so-called flattening of the curve. This theoretical evolution depends on the choice of parameter values, especially the propagation parameters. Given the selected parameter values that allow for exogenous sources of infection, the initial convex pattern in the counts of infected is not visible. Only the concave part of the curve, up to its flat part, is observed. One can perform similar dynamic analysis for other credible scenarios, which is called the sensitivity analysis. The Figures given above have been simulated with time independent propagation parameters. A self-isolation measure introduced at some point, would have changed subsequent evolution. There is a first a tendency to reach a flat part on the curve without self-isolation, and then to reach a lower flat part on the curve with self-isolation measures. Therefore over a longer period, the first flat part can appear as a smoothed peak. If selfisolation measures are lifted afterwards, a second peak of infections is expected, and so on, resulting in a sequence of stop and go [Ferguson et al.(2020) ]. This section presents the estimation of the extended SIRD model from data on Covid 19 propagation in France over the period of 22 days between 03/16 to 04/06, 2020. However we focus on data that are reliable for estimation purpose. The fully observed states are the states ID and D. State ID is assumed equivalent to hospitalization, as 8 The effect of Covid-19 on the total mortality rate is unclear. There is a negative effect of the virus, but also positive effect due to better protection against other viruses such as influenza, and reduced number of car accidents. 9 It is not the case for countries where the propagation is too recent, or isolation decided too late, or data unreliable at the beginning of the episode (Wuhan), or the isolation period too short (Denmark) or introduced in successive steps, or isolation decisions being very different depending on the regions of (Germany and the US). the counts of ("confirmed") detected, which are publicly available, are measured with error. This is due to the counts of detected individuals being derived from test results, while not all tests results may have been recorded, some people could have been tested multiple times, inflating the counts, or people have not been tested at random or without an adequate exogenous stratification, which creates selectivity bias 10 . In contrast, the hospitalization data are reliable and regularly updated. State D is assumed observed through total death counts. It includes death from Covid-19, which are reported online as D/H, i.e. death after hospitalization and known to underestimate the true D/H values true number of deaths due tu coronavirus, as these do not include all deaths from COVID-19 at home, or in long term care homes. The series to be estimated are infected undetected IU and recovered R. We use the available series of ("confirmed") detected and of recovered after hospitalization for comparison with the estimates. More specifically, we use the French data on the total daily number of deaths from the French National Statistical Institute INSEE( 2020) and the daily data on coronavirus pandemic from Sante Publique France (2020) reported at https://dashboard.covid19.data.gouv.fr/ and https://www.linternaute.com/actualite/guide-vie-quotidienne/2489651-covid-19-en-franceles-dernieres-statistiques-au-06-avril-2020/ 11 . The daily evolutions of total counts of hospitalized, detected, recovered and deceased individuals reported in those sources are displayed in Figure 3 . The panels display the series of "hospitalized", "confirmed" (i.e. detected), "returned from hospital" (i.e. recovered after hospitalization) in the top row and left bottom panels, respectively. In the bottom right panel, the dynamics of counts of total deceased (solid line) and deceased due to Covid-19 (dashed line) are distinguished. The estimation of the model introduced in Section 5.1 produced the following results. The estimated coefficients are a 1 = −8.6517, a 2 = −11.1481, b 1 = 0.0034, b 2 = 2.499e − 05, c 1 = 8.482e − 05, c 2 = 0.00028. The estimated coefficient of mortality rate p 15 is 3.1575e-05, which is close to the mortality rate in France of 3e-05 = 0.03/1000, used in the simulation study in Section 5.2. The estimated transition matrix is given below in Table 1: Table 1 The estimated counts exceed those reported by the sources. In particular, the observed and estimated counts on April 06, 2020, which is the last day of sample are as follows: The final observed count of ("confirmed") detected is equal to 78167 and is 1.2 times smaller that the estimated final count of infected and undetected (IU) equal to 94461. The observed final count of Recovered (after being hospitalized) equal to 17250 is 6.24 times smaller than the estimated final count of Recovered equal to 107640. Let us now present a scenario of a projected evolution, based on the estimated coefficients values and probabilities. Figure Figure 4 ]. iii) The treatment of missing data can likely be improved by introducing additional explanatory variables that are expected to impact the virus propagation. This approach is followed in Hortacsu et al. (2020) who use hospitalization data from various regions and interregional transportation data to forecast infection rates. iv) Other specifications of the propagation functions π t can also be considered and compared [see Wu et al. (2020) ]. Instead of characterizing the individual histories by the qualitative sequences Y it , a sequence of J-dimensional vectors Z it can be alternatively considered, where component j is the 0-1 indicator of Y it = j. Then we have: where P (t − 1) denotes the transition matrix from date t-1 to date t. By the iterated expectations theorem, we get: Let us now consider the covariance: (by the iterated expectation and using E(Z t ) = p(t)) (by taking into account the 0-1 components of Z) This is the expression of the autocovariance as a function of the p(t)'s and model parameters. p 2 (t) = p 12 + (p 22 − p 12 )p 2 (t − 1) + (p 32 − p 12 )p 3 (t − 1) and substitute into this equation the expression of p 2 (t) derived in part i). We get: It follows that: a(P ) = p 12 (p 23 − p 13 ) + p 13 (1 − p 22 + p 12 ), b(P ) = p 22 − p 12 + p 33 − p 13 , c(P ) = (p 22 − p 12 )(p 13 − p 33 ) + (p 23 − p 13 )(p 32 − p 12 ). To get a recursive equation of order 2, we need the second condition: condition 2: c(P ) = 0 To identify functions a, b, c from the observed p 3 (t), we need: condition 3: The matrix 3×(T −2) with columns (1, ..., 1) , (p 3 (T −1), p 3 (T −2), .., p 3 (2)) and (p 3 (T − 2), p 3 (T − 3), ..., p 3 (1)) is of full column rank. This implies, in particular, the order condition: T ≥ 5 in Proposition 3. The following condition 4 is needed for computing the exact under-identification order of P from functions a, b, c. condition 4: By taking into account the unit mass restrictions on the rows of P ,the Jacobian of (a, b, c) has rank 3. Note that condition 4 implies condition 2. Some Discrete-Time SI, SIR and SIS Epidemic Models Lippi (2020) :A Simple Planning Problem for COVID-19 Mathematical Models in Population Biology and Epidemiology Voluntary and Mandatory Social Distancy: Evidence on COVID 19 Exposure Rates from Chinese and Selected Countries Multinomial Goodness-of-Fit Tests Robust Extended Kalman Filtering Estimating the Number of Infections and the Impact of Non-Pharmaceutical Interventions on Covid-19 in 11 European Countries,Imperial College London Competition Between Growths Governed by Bernoulli Percolation Estimating Equations in the Presence of a Nuisance Parameter The Number of Individuals in a Cascade Process Analysis of Virus Propagation: Anatomy of Stochastic Epidemiological Models Percolation Processes: Lower Bounds for the Critical Probability The Mathematics of Infectious Diseases Estimating the Fraction of Unreported Infections in Epidemics with a Known Epicenter: An Application to COVID-19 A New Extension of the Kalman Filter to Nonlinear Systems,11th Int.Symp. on Aerospace/Defence,Sensing,Simulation and Controls INSEE A Contribution to the Mathematical Theory of Epidemics The convergence of the EKF Substantial Undocumented Infection Facilitates the Rapid Dissemination of Novel Coronavirus(SARs-CoV2) Molinari(2020) Estimating the COVID-19 Infection Rate:Anatomy of an Inference Problem Econometric Analysis of Qualitative Response Models Applications of Mathematics to Medical Problems Information Recovery in a Dynamic Statistical Markov Model Estimation of Time Varying Markov Processes with Aggregate Data Estimation of the Asymptomatic Ratio of Novel Coronavirus Infection(COVID-19) Filtering via Simulation: Auxiliary Particle Filters Donnees Hospitalieres Relatives a l'Epidemie Covid-19 Marginalized Particle Filters for Mixed Linear/Nonlinear State Space Models The Extended Kalman Filter as a Local Asymptotic Observer Estimates of the Severity of Coronavirus Disease 2019;A Model Based Analysis An Introduction to Infectious Disease Modelling Estimation of the transmission Risk of the 2019-nCoV and its Implication for Public Health Interventions The Unscented Kalman Filter for Nonlinear Estimation Generalized Logistic Growth Modelling of the Covid 19 Outbreak in 29 Provinces in China and the Rest of the World They are easily derived, given that the optimization in Proposition 2 is deterministic. Therefore, estimatorsp(1),p(2), ...,p(T ),θ are deterministic functions of observationŝ A t = Af (t), t = 1, .., T . If the transition matrix is twice continuously differentiable with respect to p(t − 1) and θ in a neighbourhood of the true values, these deterministic functions are continuously differentiable. Then, by using the asymptotic normality of f (t)'s (Proposition 1), we can apply the delta method to deduce the 1/ √ N rate of convergence of the estimators and their asymptotic variance-covariance matrix from the one of theWhen the number of observation dates and of missing counts is too large, the use of the delta method can be numerically cumbersome. It can be replaced by a bootstrap method (for which the regularity conditions of validity are satisfied in our framework, or by the approximated accuracies provided by an EKF, or UKF algorithm.Appendix 3 This appendix derives the equations used in the proof of Proposition 3. It provides the closed form expressions of functions a(P ), b(P ), c(P ), and outlines conditions 1 to 4 for the validity of Proposition 3.i) Let us first solve the second equation of system (4.3). We get:if the following condition is satisfied:condition 1: p 23 is different of p 13 .ii) Next, let us consider the first equation of system (4.3):