key: cord-0730128-337ssc8n
authors: Parker, Matthew R. P.; Li, Yangming; Elliott, Lloyd T.; Ma, Junling; Cowen, Laura L. E.
title: Under‐reporting of COVID‐19 in the Northern Health Authority region of British Columbia
date: 2021-11-01
journal: Can J Stat
DOI: 10.1002/cjs.11664
sha: 75677e3f8774a69f6b1335430d73ff6b84f77588
doc_id: 730128
cord_uid: 337ssc8n

Asymptomatic and pauci‐symptomatic presentations of COVID‐19 along with restrictive testing protocols result in undetected COVID‐19 cases. Estimating undetected cases is crucial to understanding the true severity of the outbreak. We introduce a new hierarchical disease dynamics model based on the N‐mixtures hidden population framework. The new models make use of three sets of disease count data per region: reported cases, recoveries and deaths. Treating the first two as under‐counted through binomial thinning, we model the true population state at each time point by partitioning the diseased population into the active, recovered and died categories. Both domestic spread and imported cases are considered. These models are applied to estimate the level of under‐reporting of COVID‐19 in the Northern Health Authority region of British Columbia, Canada, during 30 weeks of the provincial recovery plan. Parameter covariates are easily implemented and used to improve model estimates. We compare two distinct methods of model‐fitting for this case study: (1) maximum likelihood estimation, and (2) Bayesian Markov chain Monte Carlo. The two methods agreed exactly in their estimates of under‐reporting rate. When accounting for changes in weekly testing volumes, we found under‐reporting rates varying from 60.2% to 84.2%.

The ongoing COVID-19 pandemic has already led to around 147,000 confirmed cases and 1,700 deaths [Correction added on 11 November 2021, after first online publication on 1 November 2021: "17,000 deaths" was changed to "1,700 deaths"] in British Columbia (BC), and around 3.9 million deaths worldwide as of the end of June 2021. Disease analytics allow us to estimate the extent of unreported cases. Case counts are available from many government sources. In British Columbia, Canada, the BC Centre for Disease Control releases periodic surveillance reports, which often include COVID-19 case counts for each of five Health Authority regions. However, cases of COVID-19 can go unreported because of controllable factors such as refusal to test or low volumes of virus testing, as well as uncontrollable factors such as asymptomatic or pauci-symptomatic cases, incorrect self-diagnosis or failure to disclose. This poses problems in disease control. For example, it leads to under-reported case counts and thus an underestimate of the severity of the pandemic. Undetected cases also drive community transmission, reducing the effectiveness of contact tracing, quarantine and isolation. In this article, we aim to estimate the true size of an epidemic, given the observed case counts and outcomes.

There is substantial evidence for the existence of unreported cases of COVID-19. For example, Buitrago-Garcia et al. (2020) performed a systematic review and found a 95% confidence interval of 17%-25% for the proportion of truly asymptomatic cases. Several seroprevalence studies have also been conducted to estimate the proportion of unreported cases (Song et al., 2020; Bendavid et al., 2021; Saeed et al., 2021) . In particular, Skowronski et al. (2020) found that in May 2020 the number of undetected cases in British Columbia was between 2.25 and 20.5 times greater (a 95% confidence interval) than the reported number of cases. Low ascertainment rates could potentially have been improved using more effective testing strategies (Lawless & Yan, 2021) . However, the number of testings required to have only a small percentage of unreported cases would be prohibitive.

Several inter-related models have been developed in recent years to address the problem of estimating abundance in the presence of under-reporting using only case-count data. For example, the INAR (integer autoregressive) hidden population model of Fernández-Fontelo et al. (2016) was used to estimate weekly cases of human papillomavirus in Girona. More recently, Moriña et al. (2021) used a Bayesian hierarchical model to estimate the under-reporting rates of COVID-19 in Spain and found that their results matched seroprevalence data (Spanish Ministry of Health, 2020) . Fernández-Fontelo et al. (2020) developed a hidden INAR(1) model designed to analyze the COVID-19 pandemic using only COVID-19 case counts as observed data to inform model-fitting. Their model was used to estimate the under-reporting rates of COVID-19 in several small regions of Spain, and they used susceptible-infectious-removed (SIR) modelling to account for population dynamics. The underlying process was X t = ∘ X t−1 + W t (a t ), where X t is the actual number of new cases at time t, is the binomial thinning probability, a t is the new active cases modelled using SIR and W t is a Poisson process.

The N-mixture model is a hierarchical model that can be seen as a series of related models ordered by their conditional probability structure (see Kéry & Royle, 2015, Section 2.3) . Often, these models are fitted to data using maximum likelihood estimation (MLE) . The N-mixture models can also be viewed from the Bayesian modelling perspective as hierarchical Bayesian models. Kéry & Royle (2015) used an N-mixture Bayesian approach to analyze Swiss Great Tits data. They found that the posterior means from a Bayesian analysis (using vague priors) numerically agreed well with the ML estimates.

Hidden INAR models can be viewed as a special case of N-mixture models in which there is only one site and, with the addition of a mixture distribution for detection, allowing for occasional perfect detection. N-mixture models have been used extensively since their inception (Royle, 2004) and have been extended to allow for open population dynamics (Dail & Madsen, 2011) . The application of the open N-mixture modelling framework to disease analytics has been discussed by DiRenzo et al. (2019) .

Alternative methods such as capture-recapture (see, e.g., Xu et al., 2014; van Dam-Bates, Fyfe & Cowen, 2016) can produce more precise estimates of detection probability but require more extensive data-gathering (including unique identifiers for tracking, such as personal health numbers), which is not possible during the early stages of a public health crisis such as the COVID-19 pandemic.

Unlike classical MLE, Bayesian methodology can remove the task of integrating the latent abundance parameters (N t , for t ∈ {1, 2, … , T}) from the model likelihood and avoid computational complexity associated with maximizing the likelihood function. Comparisons between Bayesian and ML N-mixture estimates were made by Toribio, Gray & Liang (2012); however, these comparisons were done only for the closed population models. In this article, we compare MLE and Bayesian Markov chain Monte Carlo (MCMC) model-fitting approaches for an open population model.

We propose a novel model to estimate the levels of under-reporting in regions affected by COVID-19. The model is built on the open population N-mixtures framework (Royle, 2004; Dail & Madsen, 2011) as well as the hidden INAR framework (Fernández-Fontelo et al., 2016) , with the population dynamics modified to allow for domestic spreading of the virus as well as importation of new cases from other regions. We also incorporate a multinomial component in the models to account for active cases, deaths and recoveries. These models are ideally suited to estimate the detection rates when limited data are available.

We applied the new model to data from the Northern Health Authority region of BC and improved the model by incorporating parameter covariates. We found in our applications that the MCMC and the MLE gave comparable results for estimating the probability of detection.

The classical N-mixture model developed by Royle (2004) allows estimation of the population abundance N (which is a latent variable in the model) using under-counted observations n, which are conditional on N through a detection thinning process. The abundance N it at site i, time t, is modelled as N it ∼ Poisson( ). A detection thinning process is used to generate observed counts n it , which is modelled as n it ∼ Binomial(N it , p). Here, is the initial mean site abundance, and p is the detection probability at time t. The model extensions of Dail & Madsen (2011) allow for population dynamics, removing the closed population assumption of the original N-mixture model. The standard dynamics assumption is N it+1 = S it + G it , where S it models population survival, and G it models new population gains (immigration) between t and t + 1.

The form of the N-mixture model affords many distributional choices, making it flexible for studying different sorts of populations. To develop a disease analytic version of the N-mixture model, several modifications are necessary. We will consider only the single-site case, and thus we drop the site subscript i; however, we note that under a conditionally independent sites assumption, it is easy to extend these models to multiple sites. We let T denote the number of sampling occasions: t ∈ {1, 2, … , T}. We make several additions and modifications to the open population N-mixture models. We specify the model in Equation (1), with the variables defined below, and refer to this specification as we detail the model throughout the remainder of this section.

Initial Abundance:

Observation Process:

Similar to standard N-mixtures models, we make the assumption that the initial (start of study) active cases is the unknown random variable N 1 . We use a Poisson distribution with mean to model this initial population size (Eq. 1: Initial Abundance). This can be thought of as the random state of the dynamic system at the start of data collection.

To model population changes with time, we consider the three possible future states of any individual from time t to time t + 1. An individual who is currently infected at time t can remain an active case, recover, or die by time t + 1. Thus we partition the total infected individuals N t at time t into the three categories relative to t + 1: cases who will remain active A t ; cases who will recover R t ; and cases who will die D t . The partitioning is done using a multinomial distribution, with probability of mortality p d , probability of recovery p r , and probability of remaining an active case of p a = 1 − p d − p r (Eq. 1: State Process). Using this multinomial model assumes that (i) there are exactly three categories, (ii) all individuals are independent and have the same probabilities for each category, and (iii) the probabilities are constant over time. Assumption (i) seems reasonable (and if another category were to be considered, it could be added to the model without difficulty). Assumption (ii) is a large simplification, since many factors will influence an individual's probability of recovery and death (such as age, genetics, access to health care, etc.). This simplification is necessary due to a lack of individual-level information, and could be alleviated somewhat using, for example, supplementary demographic data such as age along with stratified modelling techniques. Finally, assumption (iii) may be true over short periods; however, the probabilities of recovery and death may change over the course of a pandemic because of such factors as improved understanding of the disease and increased access to treatment. Fortunately, assumption (iii) can be relaxed by including time-varying parameter covariates for p r and p d . We note that because of the relationship p a = 1 − p d − p r , including a covariate for p d and not for p r would imply that any change in p d would be reflected entirely in p a . So in normal use cases, it would make sense to include the same covariate for p r as for p d , to allow for some of the deaths to shift into the recovered category rather than remain in the active category. The parameter p d could change over time owing to factors such as local hospitals reaching their capacity. We note that for our Northern Health Authority case study, we do not expect assumption (iii) to be a significant factor.

We have established a partitioning of the active cases into categories allowing for deaths and recoveries. We now consider three pathways for producing active cases over time. First, we have from our partitioning the set, A t , of cases that will remain active from time t to time t + 1. Second, we consider simple importation of cases G t (Eq. 1: Imported Cases). We consider importation of cases to be caused by the immigration of external active cases from other regions. We model G t as a Poisson random variable, with mean value . Thus we have defined as the average number of new imported active cases occurring between time t and time t + 1. Third, we consider domestic spread via community contact S t (Eq. 1: Domestic Spread). Since the rate of domestic spread should be proportional to the current number of active cases (hence exponential growth in active cases is possible), we use the number of active cases at t − 1 to moderate the mean new DOI: 10.1002/cjs.11664

The Canadian Journal of Statistics / La revue canadienne de statistique infections due to domestic spread. We model S t using a Poisson distribution with expected value N t−1 . Thus, we have defined as the average number of new infections per active case during one sampling period. To determine the number of active cases at time t (for t > 1), we sum the three sources of active cases: N t = A t−1 + S t + G t (Eq. 1: Abundance Updates).

The observation process in Equation (1): Observation Process has two components: a binomial thinning, and a multinomial partitioning. The observed data for this model are the sets of newly observed active cases {n t } for t ∈ {1, 2, … , T}, newly recovered from previously observed active cases {r t } for t ∈ {1, 2, … , T − 1} and newly deceased cases {D t } for t ∈ {1, 2, … , T − 1}. The term "newly observed" means observed since the previous sampling occasion, so that in our case study n t+1 represents the total cases reported during week t + 1 that have been observed after the cases reported during week t. The newly observed active cases n t are assumed to be under-counted. We use binomial thinning to model the under-counting; however, unlike with N-mixtures, we need to subtract the currently active previously observed cases a t−1 − r t−1 − D t−1 from the total N t prior to thinning. This is because all active cases that have been observed are tracked through time until they either recover or die; they remain active, and cannot be "re-observed" while still active. We calculate the observed active cases at time t > 0 as a t = n t + a t−1 − r t−1 − D t−1 , with a 0 = 0; a t can be understood as "new active cases at time t" plus "previous active cases at time t − 1" minus "previous active cases which are no longer active at time t." The quantities a t and A t are subtly different; A t are the total cases remaining active from time t to time t + 1, while a t are the observed cases which are currently known to be active at time t. The observed active cases a t are partitioned using a multinomial similar to the one used to partition N t . The primary difference is that all three categories are fully observed: r t are the currently active observed cases who will recover between t and t + 1; D t are the observed deaths between t and t + 1; and a t − r t − D t are the observed active cases which remain active from t to t + 1. We use the same three probabilities as in the first multinomial. Thus, r t is the observed subset of R t , D t is fully observed, and a t − r t − D t is the observed subset of A t . This adds the extra assumption that the observed active cases are equally likely to recover as are the unobserved active cases. This is a critical assumption, as it allows identifiability of the unobserved quantities A t and R t through the observed data. Relaxing this assumption would require the addition of a known relation between the probability of recovery for observed individuals and that for unobserved individuals. A plate diagram specifying our model is shown in Figure 1 , illustrating the data/model interactions. The model has six estimable parameters: , , , p, p d and p r .

The original approaches to inference for N-mixture models used MLE (Royle, 2004; Dail & Madsen, 2011) , in which the latent variables N t are removed from the likelihood via integration over states (summation of the likelihood over N t ∈ {n t , n t + 1, n t + 2, … , K}, for some suitable upper bound K). The computation times involved in this method become problematic for very large K, since the summations dominate the computing time, which has complexity (K 3 ). There are several advantages to using MLE: it is not dependent on correctly specified prior distributions (this can be a disadvantage when reliable prior information exists), and it has excellent large-sample properties, such as consistency, asymptotic normality, and asymptotic efficiency (Bain & Engelhardt, 1992, p. 316) . MLE also has the benefit of being deterministic (provided that the optimization method is deterministic, as is the case with the BFGS optimization algorithm; Broyden, 1970; Fletcher, 1970; Goldfarb, 1970; Shanno, 1970) .

We build the likelihood function from the model described in Section 2.1. In classical N-mixtures models, integration is done over the possible states of the latent variables N t . In our model, we have a second set of latent variables R t , necessitating a second integration over states (summation over R t ∈ {r t , r t + 1, r t + 2, … , N t − D t }). This leads to the likelihood function shown in Equation (2). We note that the likelihood function is written as ℒ for brevity, FIGURE 1: The data generating process assumed for our COVID-19 model. Top: the population dynamics are illustrated as a plate diagram (Koller & Friedman, 2009 ). The t = s labels indicate the variables defined at time t = s, while the a < t < b labels indicate processes which occur between observation time a and observation time b. Bottom left: the model detection mechanism, along with the recursive definition for calculated quantity a t . Bottom right: the shift in time for recoveries and deaths between the reported data used in practice, and the model data matrix definition is illustrated. Recoveries and deaths associated with active cases at time t are reported at the next reporting period, t + 1, rather than at time t (since they occur between reporting periods).

and should be understood as shorthand for ℒ ( , , , p, p d , p r |{n t }, {D t }, {r t }, K). This model can also be extended to allow for under-counted deaths, in which case a third integration over states would be necessary:

where d t would be the observed deaths and D t would be the latent variable for new deaths; in this case, "Observed active cases" from Equation (1) 

(2)

Here, P a,b is the transition probability for transitioning from an active number of cases a to an active number of cases b. We implemented this model using the R software (R Core Team, 2020). The likelihood function as specified in Equation (2) was programmed in R (archived code is available at http://dx.doi.org/10.5281/zenodo.5502191), and optimization was done using the DOI: 10.1002/cjs.11664

The Canadian Journal of Statistics / La revue canadienne de statistique BFGS optimization algorithm employed by the R function optim. The value for K, the upper bound on summations, was chosen by optimizing the model likelihood with increasing values of K until the estimated parameter convergence was observed at K = 200. Computation time for these models, run on a 4.0 GHz AMD Ryzen 9 3900X with 24 logical processors, varies from several hours to several days depending on the model complexity (number and type of parameter covariates). The likelihood function may have large flat regions, especially when no additional covariates are used to inform model fitting. This is important for the estimability of the correlated parameters and , which both inform population growth.

The Bayesian approach to parameter estimation of our COVID-19 model is different from the standard MLE approach beyond the inclusion of priors. Specifically, summations over latent variables are not required for parameter updates; instead, the values of the latent variables are re-sampled. As well, the transition probability calculation P a,b from Equation (2) is replaced with the distributions for S t and G t . These changes necessitate the use of stochastic methods such as MCMC for parameter estimation. If we let be the prior distribution for parameter , then the full joint distribution over the data and parameters is given in Equation (3), where the abundance updates are still given by

Details regarding the dependence between the variables in these models are described in Figure 1 .

The Bayesian models were implemented using JAGS (Plummer, 2003) through the RJags package (Plummer, 2019) using the R software (R Core Team, 2020). Computation for these models took several hours when fitting the base model. The full joint distribution is given by the following:

.

(3)

Posterior predictive checking is useful for finding discrepancies between the observed data and the data that can be described by the fitted model. It is a popular Bayesian model-checking approach used by ecologists (see, e.g., Kéry & Schaub, 2011; Gelman et al., 2013) . In this paradigm, model fitness is checked by simulating data generated under a fitted model and comparing the simulated data with the observed data. If the observed data is substantially different from the simulated data, then the model must be unable to effectively describe the observed data. Posterior predictive checks can be conducted by examining histograms of the simulated data and comparing them with the observed data, or by computing the posterior predictive P-values (Gelman et al., 2013 ). Since our model assumes fully observed D t , we generated the simulated observations of D t using Equation (1): State Process, and used a binomial rather than a multinomial to generate r t :

This was necessary to avoid dual specification of D t . As well, we require a t to be non-negative, and we accomplish this by rejecting simulated data for which a t is negative.

To run the Bayesian models, the adaptation period was 1.4 million iterations. We discarded 1.2 million initial samples as the burn-in, and ran 200,000 additional iterations to obtain the posterior estimates. In order to avoid issues with parameter auto-correlation, we used thinning to keep 1 out of every 200 iterations.

The Northern Health Authority region of British Columbia, Canada was chosen for this preliminary study because of the relatively small number of cases, allowing for faster model-fitting. Figure 2 shows the three sets of observed data: {n t }, {r t } and {D t }, for t ∈ {1, 2, … , 30}.

We gathered publicly available data from the BC CDC Surveillance Reports (BC Centre for Disease Control, 2020). For the 30 weeks starting 26 March 2020 and ending 15 October 2020, we used weekly counts aggregated on Thursdays. The 30-week time period was chosen because the data definitions in the public reports were relatively stable over this period, making data comparable between weeks. We note that the reporting period shifted from Thursdays to Fridays after the 15 October 2020 reporting date. The start dates for each phase of the provincial recovery plan are shown in Table 1 . The end of each phase of the provincial recovery plan caused changes/reductions in the provincial protective measures (such as reopening of local businesses in Phase 2), which led to inhomogeneity in the dynamics parameters. To account for this inhomogeneity, we fitted the parameters dependent on time-varying covariates using link functions. A summary of the public health measures taken for each phase is shown in Table 1 . Additional data used as covariates include the number of new COVID-19 tests administered per week (Province of British Columbia, 2020)-which is a strong indicator of the detection rate p-and Google regional mobility data (Google LLC, 2020)-which is a potential indicator of the domestic spread ( ). The phase boundaries were used to demarcate categorical covariates for each phase.

Publicly available data are often intensely aggregated, poorly reported (Barone, 2020) , or susceptible to technical issues such as data loss (Fetzer & Graeber, 2020) . In the case of the Surveillance Report data, there are several shortcomings to consider. Reporting dates may be delayed from the date of detection, or from the date of infection. Reporting delays are also not necessarily the same between case counts, recovery counts and death counts. The number of observed cases is dependent on the testing methodology (number of tests administered, effectiveness of tests, etc.), and this testing methodology can change over time. The data also suffer from ad hoc reporting times, so that counts are not always available for each day, and some days contain lump sum data dumps. Owing to these particular data issues, we chose to use weekly aggregated counts rather than daily counts. These data limitations are concerning, and more reliable data could improve the accuracy of our results.

To compare the MLE and MCMC methods, we fitted base models with no covariates. Both the MLE and the MCMC base models were run on Compute Canada's WestGrid. The fitted model parameters are shown in Table 2 . For both methods, p d was estimated to be essentially zero, as the Northern Health Authority region of BC had no deaths until Phase 3b. Several parameter estimates are dissimilar between the MLE and MCMC methods. The estimates for and are significantly different, with no overlap in their uncertainty intervals. This is likely due to the similarity between models with small population growth (when both parameters and are small, this may cause a non-identifiability issue for these parameters). However, the Bayesian estimate of the detection probability,p = 0.30, is identical to that of the classic N-mixture estimate,p = 0.30. Parameter errors are estimated using the estimated Hessian matrix for the MLE method, and using the posterior credible intervals for the MCMC method. The Bayesian approach requires specification of prior distributions for each parameter in the model. The prior distributions we used are summarized in Equation (4). For the parameter , we use the gamma distribution with mean 15. For the parameter , we chose to use the uniform distribution with lower bound equal to 0 and upper bound equal to 30. For the detection probability parameter p and the probability of death p d , we used the Uniform(0,1) distribution to place equal probability on all possible values. Since p r and p d are dependent, we use p d as the upper bound for the uniform distribution of p r . We chose a weakly informative prior for , the uniform distribution with lower bound 0 and upper bound 5.

For the Bayesian approach we used the mean as the test quantity for posterior checking (see Figure 3 ). The red lines indicate the observed means, which are close to the middle of the simulated distributions, showing a good fit. The posterior predictive P-values for observed cases, recovered cases, and deaths are 0.37, 0.42 and 0.42, respectively. Each of the P-values is close to 0.5, which also shows that the model has a good fit.

Prior Distributions:

We performed a sensitivity analysis to determine the impact of the and prior distributions on parameter estimates. Since the parameter informs the initial population size, it is crucial that the prior not be overly informative; therefore, we performed a sensitivity analysis for . We also performed a sensitivity analysis for , since the discrepancy between the MLE and MCMC estimates for led to the concern that the prior may be overly informative. Both prior distributions were found to have low impact on the parameter estimates. For the parameter , we set the mean of the gamma distribution to be 5, 10 and 20 in the analysis. We set the variance of the prior distribution to 200 in order to have large variation in . The medians and the relative 95% credible intervals for the parameter estimates are shown for each prior distribution in Table 3 .

For the parameter , we used a uniform distribution with minimum 0 and the maximum varying from 5 to 30. The medians and the relative 95% credible intervals for the parameter estimates are shown for each prior distribution in Table 4 .

In order to ascertain the ability of the model to provide adequate parameter estimates, we performed a simulation study by setting ground truth values for the model parameters. We set the parameter values close to the parameter estimates for the Northern Health Authority region Bayesian model, and we set the number of sampling occasions to be the same, i.e., 30. Note: Parameters are initial mean abundance parameter , mean imported cases parameter , mean domestic spread parameter , probability of detection p, probability of mortality p d , and probability of recovery p r . The parameter was given a uniform prior distribution. Posterior medians and 95% credible intervals for the parameter estimates are shown.

We generated random populations and observations based on the ground truth parameter values and the model structure shown in Figure 1 . The randomly generated observation data included observed cases, recoveries and deaths. We used the generated observations to fit the base model using JAGS (Plummer, 2003) via the RJags package (Plummer, 2019) and the prior distributions in Equation (4). We repeated the process 150 times, calculating the mean of the 150 posterior medians ( Table 5 ). The coverage probability for the 150 credible intervals for the detection probability is close to 0.95, indicating good model performance. For the other parameters, the coverage probabilities are smaller; however, considering the small number of replicates, the coverage is reasonably close to 0.95. The simulation study was conducted on Compute Canada's WestGrid.

We incorporated parameter covariates into our model, and we considered several nested models, starting with a base model with no covariates. MLE was used to fit each of the nested models. Note: Parameter covariates are indicated by shorthand: mob for Google mobility data as covariates for (+6 covariates); vol for testing volume as covariate for p (+1 covariate); and pha for BC Recovery Plan phase as covariates for (+3 covariates).

Covariates were incorporated into the model using log transforms (to limit the parameter range from 0 to ∞) for parameters , and , and logit transforms (to limit the parameter range from 0 to 1) for parameters p, p d and p r . For example, consider the model with normalized weekly test volumes V t as a covariate for the probability of detection p. This would be included in the model using the logit transform logit(p t ) = 0 + 1 V t . In this way, 0 would be the baseline coefficient for p t , 1 would be the effect on probability of detection due to number of tests administered, and p t would be constrained to take values between 0 and 1. We used Akaike's information criterion (AIC; Akaike, 1974) to compare models (Table 6) . We also considered the small-sample version of AIC, the AICc, which produced the same top model as AIC. The nested models were all run on a 4.0 GHz AMD Ryzen 9 3900X with 24 logical processors. We considered several possible sets of parameter covariates, indicated by the following shorthand: mob for Google mobility data as covariates for (+6 covariates); vol for testing volume as covariate for p (+1 covariate); and pha for BC Recovery Plan phase as covariates for (+3 covariates). The best fitted model was (pha, vol), which included a total of 10 model parameters.

Since the covariates for p and are time-varying, the parameters themselves are also time-varying. For this reason, the parameter estimates for p t and t are summarized graphically for each week t in Figures 4 and 5 . The remaining parameter estimates are summarized in Table 7 . The estimated weekly probability of detection was used to estimate N t through a Horvitz-Thompson-type estimator,N t = n t ∕p t (Figure 6 ). FIGURE 4: Estimated probability of detectionp from the best fitted model (pha, vol), as chosen by AIC. Top: weekly detection probability, along with weekly 95% confidence intervals. Bottom: covariate weekly testing volume. Trends in weekly testing volume can be seen to match trends in probability of detection, since they are related through logit(p) = 0 + 1 V t , where V t is the normalized weekly testing volume, and i are the covariate coefficients. Note: Parameters are initial mean abundance parameter , mean imported cases parameter , probability of mortality p d , and probability of recovery p r . Time-varying parameters p and are not included. We note that all fitted models had̂≈ 0, which is on the boundary of the parameter space. This led to 95% confidence intervals of (0, ∞) and an inability to estimate uncertainty for .

We developed a novel model for disease analytics, accounting for population dynamics and under-detected cases, by incorporating three datasets that are commonly available publicly during a pandemic. The model also incorporates parameter covariates, which allow for time-varying parameters and parameter change points. We have estimated the level of under-detection of COVID-19 cases in the BC Northern Health Authority region, and found that there is substantial evidence of undetected COVID-19 cases during the first 30 weeks of the pandemic in BC We have compared two methods of implementing a base model with no covariates, and found that both the MLE and the MCMC approaches are able to describe the population dynamics and estimate levels of under-reporting. We then improved upon the parameter estimates by incorporating several parameter covariates for the MLE implementation. The best fitted model, (pha, vol) , as chosen by AIC, was found to have the probability of detection dependent on the volume of weekly tests administered, and domestic spread dependent on changes in phase for the BC Recovery Plan.

Choosing the best fitted model using AIC has the benefit of parsimony: we selected the model with the smallest number of model parameters, which explained the most variability in the observed data. The AIC approach did not select the full model (mob, pha, vol) over the best fitted model because of this parsimony. This is not surprising, since a likely effect of the different BC Recovery Plan phases should be to affect changes in population mobility. Thus, some of the information contained in the mobility data is likely accounted for in the phase data.

Estimates of weekly probability of detection for the model (pha, vol) can be seen to closely follow the trend in weekly administered tests (Figure 4) . We can see clearly that under-reporting was the lowest at the end of the 30-week period (minimum of 60.2% at week 28) and the highest between weeks 3 and 20 (maximum of 84.2% at week 4).

As illustrated in Figure 6 , there is a period (weeks 13-17) with zero observed new active cases. This leads to estimates ofN t = 0 for that time period. However, it is important to note that it is unlikely that N t is actually zero; rather, we had insufficient data for that period. This is a period with testing volume insufficient to detect the smaller number of active cases adequately. The bottom plot of Figure 4 indicates the low testing volume from week 13 to week 16.

An additional goal of this study was to identify any change in over each of the phases of the BC Recovery Plan. We show in Figure 5 the results of estimating for the model (pha, vol). The four recovery plan phases are labelled in the plot, and the effect of each phase on domestic spread is clear. For Phase 1,̂1 = 0.91, 95% CI: (0.80, 1.05); for Phase 2,̂2 = 0.71, 95% CI: (0.50, 1.03); for Phase 3a,̂3 a = 1.28, 95% CI: (1.04, 1.58); and for Phase 3b,̂3 b = 0.80, 95% CI: (0.63, 1.00). Thus Phase 3a saw the largest average domestic spread out of the four considered periods.

Regions with many COVID-19 cases may have experienced higher rates of under-detection than the Northern Health Authority region because of a lack of capacity for testing. However, our results are consistent with a serological study (Skowronski et al., 2020) in the lower mainland of BC, which used two sampling snapshots (one in March and one in May 2020). Skowronski et al. (2020) found that for May 2020, the 95% confidence interval for total cases was between 2.25 and 20.5 times greater than the reported number of cases, while we found for May 2020 (weeks 6-10) a 95% confidence interval of 3.69-8.75 times greater than the reported number of cases. While these confidence intervals are consistent with each other, and we believe the under-detection rate could be similar across regions (when testing volume has been accounted for), further studies applying these methods to other regions would be necessary for confirmation.

Owing to the summation over states to remove the latent variables N t and R t from the likelihood, the likelihood function is computationally demanding, with computation times roughly proportional to K 3 . This is one important reason to consider the Bayesian MCMC approach over the MLE approach. The Bayesian approach is more computationally tractable for large N t , as the MCMC algorithm explores the space of possible latent variable states stochastically, without resorting to integrating over states. For comparison, in the Northern Health Authority case study, both the MLE approach and the MCMC approach took close to 6 h of computing time (both were computed using Compute Canada's WestGrid). However, for larger regions, the MLE approach may become intractable because of the dependence of the computing time on the population size.

In choosing the priors for the MCMC approach, it would be beneficial to incorporate results from other comparable studies. In particular, there is little prior knowledge informing the range of the spread rate and the mean importation . Improving the prior distributions would be helpful in reducing the posterior variance and improving the accuracy of estimates (Kéry & Royle, 2015, Section 2.5.3) . Currently, our model cannot account for the differences between symptomatic and asymptomatic cases. The data we used to inform the model involve aggregated case counts of primarily DOI: 10.1002/cjs.11664

The Canadian Journal of Statistics / La revue canadienne de statistique symptomatic cases. However, our model attempts to estimate the total number of COVID-19 cases (including both symptomatic and asymptomatic cases). This is possible because asymptomatic cases will still contribute to the spread of the virus, increasing the number of symptomatic cases (Ganyani et al., 2020; Johansson et al., 2021) . Thus the population growth parameters and are informed by both the symptomatic and the asymptomatic cases (and the probability of detection p will be deflated by the presence of asymptomatic cases). However, our model does not treat the two categories separately, thus making the implicit model assumption that both categories share identical population dynamics. In future work, this could be addressed by including additional information about asymptomatic cases, which to our knowledge is not widely available from regional public data. In the hospitalization records model of Pullano et al. (2021) , researchers looked at under-detection of COVID-19 in France, and their approach included information on numbers of symptomatic and asymptomatic cases observed through testing, allowing them to assess population dynamics for those case categories. Similar data collection could be done for other regions, and the information could be incorporated into our model using techniques similar to those of Pullano et al. (2021) . If the proportion of asymptomatic cases was known over time, then that information could be used as a time covariate for relevant parameters in our model. Our model does not make use of any additional data to distinguish between domestic spread and importation. If we had access to reliable data on the proportion of observed cases that are due to domestic spread versus importation, then we could also incorporate these data as parameter covariates and further improve the estimates of the population dynamics parameters and . This could improve our model fit, as our current estimate for the apparent mean importation iŝ≈ 0, which is not likely to be the true mean importation. As an example, consider as auxiliary data the number of incoming travellers who have tested positive per week. The auxiliary data would give a lower bound on for each time point, leading to better estimates of both and . The auxiliary data would be incorporated in the MLE models by including an indicator variable in the likelihood function and in the MCMC models as additional prior knowledge for .

An important limitation of our model is the requirement for enough reporting periods T. Initially we looked at only the Phase 1 data for the Northern Health Authority region. However, since Phase 1 contained only 8 weeks of data, both the MLE and the MCMC models failed to converge. For the MLE model, increasing K caused̂to increase without bound, whilep decreased asymptotically to zero. For the MCMC model, the failure was evident in that the prior distributions chosen for both and p became overly informative.

Another shortcoming of the model, identified by an anonymous reviewer, is that the number of deaths D t is assumed fully observed, and in the model specified in Equation (1) we make the simplifying assumption that p d is the same between the two multinomial components (the state process and the observation process) of the model. This implies

, and also that

. However, this is inconsistent when D t > 0, since N t > a t . This inconsistency will have little to no effect on our case study, sincep d is small. However, for larger values of p d , this issue could be solved by adding an additional parameter to the model, which would allow p d (probability of death for observed cases) to be larger than p d (probability of death for total cases). In this case, in Equation (1): Observation Process, we would need to replace p d with p d . Several options exist for modelling . It could be another free parameter to be estimated by the model fitting process, or it could be taken to be the reciprocal of the probability of detection: = 1∕p. The latter case seems preferable, as it has the potential to inform estimates of the probability of detection better when p d is large; and when p approaches 1, p d approaches p d . This latter case leads to the following generative process: Observed Active Cases: a t = n t + a t−1 − r t−1 − D t−1 , a 0 = r 0 = D 0 = 0 Domestic Spread:

Imported Cases: G t ∼ Poisson( ), for t > 1

Abundance Updates: N t = A t−1 + S t + G t , for t > 1

Observation Process:

where p * a = 1 − p d − p r is the probability of remaining active in the observed subset of cases. We note that this implies an additional constraint: p > p d . However, the constraint is already accounted for by the assumption that deaths are fully observed.

For future applications to larger populations, we recommend introducing the parameter specified in Equation (5). To illustrate the similarity between the = 1 model and the = 1∕p model when the number of deaths is relatively small, we show in Table 8 the parameter estimates from fitting both models using the Bayesian framework. Parameter estimates are nearly identical between the two models.

For future applications of this model to more recent data, in particular to the current situation with the COVID-19 pandemic, we also recommend including total administered vaccinations as a time-varying parameter for , since a larger proportion of vaccinated individuals in a population reduces the rate of spread of the disease (Haas et al., 2021) .

In the N-mixtures disease analytics work of DiRenzo et al. (2019), data are required for modelling both infected and non-infected individuals, as sampling is assumed to occur at random in a mixed population. In contrast, we only consider infected individuals, as the publicly available COVID-19 counts data do not contain any non-infected individuals. It is possible for us to ignore the uninfected population because of the implicit assumption that the uninfected population is large compared to the infected population (so that we do not account for population (1) and the = 1∕p model from Equation (5). Parameters are initial mean abundance parameter , mean imported cases parameter , mean domestic spread parameter , probability of detection p, probability of mortality p d , and probability of recovery p r . Note that for the Northern Health Authority data, the difference in estimates between the two models is negligible. DOI: 10.1002/cjs.11664

The Canadian Journal of Statistics / La revue canadienne de statistique saturation effects, such as running out of susceptible members of the population to infect). While this implicit assumption is reasonable during early stages of the pandemic, it can be removed by including time-varying covariates for , which would allow to decrease to zero as the population becomes saturated.

We have looked at a single Health Authority region in British Columbia, Canada. We will continue to apply these methods to each of the remaining four Health Authority regions so as to compare the results across the province. By extending these methods to multi-site modelling, we intend to evaluate the situation for the province of BC as a whole, treating each Health Authority region as an independent site. This will allow each site to borrow population dynamics information from the other regions.

A new look at the statistical model identification

Introduction to Probability and Mathematical Statistics

Chile counts coronavirus deaths as 'recovered'. New York Times

BC COVID-19 data

COVID-19 antibody seroprevalence in

The convergence of a class of double-rank minimization algorithms: 2. The new algorithm

Occurrence and transmission potential of asymptomatic and presymptomatic SARS-CoV-2 infections: A living systematic review and meta-analysis

Models for estimating abundance from repeated counts of an open metapopulation

Disease-structured N-mixture models: A practical guide to model disease dynamics using count data

Under-reported data analysis with INAR-hidden Markov chains

Estimating the real burden of disease under a pandemic situation: The SARS-CoV2 case

Does contact tracing work? Quasi-experimental evidence from an Excel error in England

A new approach to variable metric algorithms

Estimating the generation interval for coronavirus disease (COVID-19) based on symptom onset data

Bayesian Data Analysis

A family of variable-metric methods derived by variational means

Government of British Columbia

Impact and effectiveness of mRNA BNT162b2 vaccine against SARS-CoV-2 infections and COVID-19 cases, hospitalisations, and deaths following a nationwide vaccination campaign in Israel: An observational study using national surveillance data

SARS-CoV-2 transmission from people without COVID-19 symptoms

Applied Hierarchical Modeling in Ecology: Analysis of Distribution

Bayesian Population Analysis Using WinBUGS: A Hierarchical Perspective

Probabilistic Graphical Models: Principles and Techniques

On testing for infections during epidemics, with application to Covid-19 in Ontario

Cumulated burden of COVID-19 in Spain from a Bayesian perspective

JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling

rjags: Bayesian Graphical Models using MCMC. R package version 4-10. Province of British Columbia. (2020). B.C. COVID-19 -Laboratory information

Underdetection of cases of COVID-19 in France threatens epidemic control. Nature, 590. R Core Team

N-mixture models for estimating population size from spatially replicated counts

SARS-CoV-2 seroprevalence among blood donors after the first COVID-19 wave in Canada

Conditioning of quasi-Newton methods for function minimization

Low SARS-CoV-2 sero-prevalence based on anonymized residual sero-survey before and after first wave measures in British Columbia

IgG seroprevalence of COVID-19 among individuals without a history of the coronavirus disease infection in Daegu

Estudio ene COVID-19: Segunda ronda estudio nacional de sero-epidemiología de la infección por SARS-CoV-2 en España

The Canadian Journal of Statistics / La revue canadienne de statistique

This study was enabled in part by the support provided by WestGrid (www.westgrid.ca) and Compute Canada/Calcul Canada (www.computecanada.ca). We would like to acknowledge the Michael Smith Foundation for Health Research and the Victoria Hospitals Foundation for support through a COVID-19 Research Response grant, as well as a Canadian Statistical Sciences Institute Rapid Response Program-COVID-19 grant to LC that supported this research. This manuscript was greatly improved by comments and feedback from two anonymous reviewers, as well as from the journal editors.