key: cord-0198146-241z9tel authors: Izadi, Farzali title: Generalized additive models to capture the death rates in Canada COVID-19 date: 2020-07-05 journal: nan DOI: nan sha: 1b98802d2d80de88b413d0a07c199e3e3737632f doc_id: 198146 cord_uid: 241z9tel To capture the death rates and strong weekly, biweekly and probably monthly patterns in the Canada COVID-19, we utilize the generalized additive models in the absence of direct statistically based measurement of infection rates. By examining the death rates of Canada in general and Quebec, Ontario and Alberta in particular, one can easily figured out that there are substantial overdispersion relative to the Poisson so that the negative binomial distribution is an appropriate choice for the analysis. Generalized additive models (GAMs) are one of the main modeling tools for data analysis. GAMs can efficiently combine different types of fixed, random and smooth terms in the linear predictor of a regression model to account for different types of effects. GAMs are a semi-parametric extension of the generalized linear models (GLMs), used often for the case when there is no a priori reason for choosing a particular response function such as linear, quadratic, etc. and need the data to 'speak for themselves'. GAMs do this via the smoothing functions and take each predictor variable in the model and separate it into sections delimited by 'knots', and then fit polynomial functions to each section separately, with the constraint that there are no links at the knots - second derivatives of the separate functions are equal at the knots. In late December 2019, some severe pneumonia cases of unknown cause were reported in Wuhan, Hubei province, China. These cases epidemiologically linked to a seafood wholesale market in Wuhan, although many of the first 41 cases were later reported to have no known exposure to the market. In early January 2020, the World Health Organisation (WHO) named this novel coronavirus as Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) and the associated disease as COVID-19 [14] . Following this report, there has been a rapid increase in the number of cases as on 24th June 2020, there were over 9.2 million confirmed cases and almost 475,000 deaths worldwide, while the confirmed cases in Canada was 103,000, with 8,500 deaths mostly distributed in three provinces of Quebec, Ontario, and Alberta. The death rates for these cases which are available from https://www.canada.ca/ en/public-health/services/diseases/2019-novel-coronavirus-infection.html are shown in figure 1. From the box plot one can see that there are a few outliers in the Alberta data only, see figure 2. Also the trend, seasonal and random components of each case can be obtained by decomposition function of time series where in figure 3 we plotted only the Canada deaths. To see the normality of the data, we can plot the histogram as well as the density of the deaths, figure 4. The plots clearly show that there are substantial overdispersion relative to the Poisson for all cases except possibly for the Alberta data. In general, generalized additive models (GAMs), see e.g. S. Wood, 2017, [10] are a semi-parametric extension of the generalized linear models (GLMs), used often for the case when there is no a priori reason for choosing a particular response function (such as linear, quadratic, etc.) and need the data to speak for themselves. GAMs first introduced by Hastie and Tibshirani, 1986 [7] and Hastie and Tibshirani, 1990 [8] and are widely used in practice (Ruppert et al. 2003) [6] ; (Fahrmeir, L., Lang, S. 2001) [1] ; (Fahrmeir, et al. 2004 ) [2] (Fahrmeir et al. 2013) [3] . On the other hand, GLMs themselves are extension of the linear models (LMs). To understand the thing better, let us start with LMs. For a univariate response variable of multiple predictors one may simply write It is clear that the response variable y is normally distributed with mean µ, and variance σ 2 and the linearity of the model is apparent from the equation. One of the issues with this model is that, the assumptions about the data generating process are limited. One remedy for this is to consider other types of distribution, and include a link function g(.) relating the mean µ, i.e., (2. 2) E(y) = µ, g(µ) = αX. In fact the typical linear regression model is a generalized linear model with a Gaussian distribution and identity link function. To further illustrate the generalization, we may consider a distribution other than the Gaussian for example Poisson or a negative binomial distribution for a count data where the link function is natural log function. where D is any exponential family distribution. We can still generalize more to add nonlinear terms in the above equation namely where µ i = E(y i ) and y i ∼ an exponential or non-exponential family distribution and f j s are any univariate or multivariate functions of independent variables called smooth and nonparametric part of the model, which mean that the shape of predictor functions are fully characterized by the data as opposed to parametric terms that are defined by a set of parameters like the parameter vector α in the linear part in which both to be estimated. Having said that, the ordinary least square problem generalizes into where λ j , controlling the extent of penalization and establishes a trade-off between the goodness of fit and the smoothness of the model. On the other hand, these functions are represented using appropriate intermediate rank spline basis expansions of modest rank k j , as Substituting the function under the integral sign with the above equation we get where the right hand side is a quadratic form with respect to the known matrix S j . Collecting both the parametric and non-parametric coefficients into a double parameter (α, β), we obtain Writing S λ = n j=1 λ j S j we get the more compact form Letβ be the maximizer of L and H the negative Hessian of L atβ. From a Bayesian viewpointβ is a posterior mode for β. The Bayesian approach views the smooth functions as intrinsic Gaussian random fields with prior given by Writing the density in (2.8) as D g , and the joint density of y and β as D(y, β), the Laplace approximation to the marginal likelihood for the smoothing parameters λ and α is D(λ, α) = D(y, β)/D g (β, y). Finally, nested Newton iterations are used to find the values of log λ and α; maximizing D(λ, α ) and the correspondingβ. Let y k denotes the deaths reported on day k. We will study the Canada deaths in general and the provinces of Quebec, Ontario, and Alberta in particular cases. Due to overdispersion seen in the deaths, we assume that y k follows a negative binomial distribution with mean µ and variance µ + µ 2 /θ. Note that N B(θ, p), where p = µ/(µ + θ). Then let where g is a link function log in our discussion, f is a smooth function of time, t k , measured in days, f w is a zero mean cyclic smooth function of day of the week d k ∈ {1, 2, · · · , 7}, set up so that f (n) (14), f m is a zero mean cyclic smooth function of day of the month d k ∈ {1, 2, · · · , 30}, set up so that f For the deaths of Quebec, we see that the day, day of week is an appropriate choice for the predicted variables. As the results of figure 6 shows, all the variables are statistically significant with 0.75 as R-squared. In the case of the Ontario we see that also the day, day of week and day of month is an appropriate choice for the predicted variables. As the results of Figure 7 shows, all the variables are highly significant with 0.82 as R-squared. We figured out that the day, day of biweek as predictors for the Alberta case is an appropriate choice for the model. As the results of figure 8 shows, all the variables are statistically significant with 0.722 as R-squared. The next statistics is the results of the gam.check consists of Q-Q plot where for a good fit the data should lie on the red line. The second is the histogram of the residuals. In this plot, the histogram must be symmetric with respect to the line x = 0. Third plot is the Residuals vs. Linear predictors. This plot must be With gam model it is also straightforward to make inferences about when the peak in f occurs. To this end, it is enough to use the model matrix by removing cyclic part, the coefficients and variance-covariance matrix of the model to estimate the model functions and 95% confidence intervals. To find the day of occurrence of the peak for each corresponding underlying death rate function, f , simply we generates multivariate normal random deviates using 'rmvn' function from 'mgcv' package in which it takes 3 arguments as number of simulations, the coefficients and variancecovariance matrix of the model. The results for all 4 cases are shown on the figures 17 and 18 respectively. To obtain the model priors with auto-generated code and associated data is to simulate jagam from rjags library in GAMs. We also load glm to improve samplers for GLMs. This is useful for inference about models with complex random effects structure. The new mgcv function jagam is designed to be called in the same way that the modeling function gam would be called. That is, a model formula and family object specify the required model structure, while the required data are supplied in a data frame or list or on the search path. However, unlike gam, jagam does no model fitting. Rather it writes JAGS code to specify the model as a Bayesian graphical model for simulation with JAGS, and produces a list containing the data objects referred to in the JAGS code, suitable for passing to JAGS via the rjags function jags.model (Plummer 2016) [5] . To infer the sequence of past fatal infections one needs to produce the observed sequence of deaths. Verity et al. 2020 [9] show that the distribution of time from onset of symptoms to death for fatal cases can be modelled by a gamma density with mean 17.8 and variance 71.2 (s.d. 8.44 ). Let f v (t) be the function describing the variation in the number of eventually fatal cases over time. Let B be the lower triangular square matrix of order n, describing the onset-to-death gamma density, given above where n is the number of day of pandemic under consideration. Then we have B ij = γ(i − j + 1) if i ≥ j and 0 otherwise. Now Bf v = h, where h(k) is the expected number of deaths on day k. Then log(f v (k)) can be represented using an intermediate rank spline, again with a cubic spline penalty. We can then employ exactly the model of the previous section but setting f (k) = log(h(k)). The only difference is that we need to infer f v over a considerable period before the first death occurs where 15 days is clearly sufficient given the form of deaths data. After executing the jagam code for the data frame of deaths, day, day of week, day of biweek and day of month variables -note that we have used different combination of the above variables to see which one is appropriate of Canada deaths in general and Quebec, Ontario and Alberta in particular cases. Jagam code with Poisson family distribution produces a model containing all the priors. Then by adding the matrix B and a bit of extra regularization to the output we pass the model to jags.model function JAGS is Just Another Gibbs Sampler. It is a program for the analysis of Bayesian models using Markov Chain Monte Carlo (MCMC) simulation. Then by passing the result to the jags.samples function with the parameters of thin=300 and iteration=1000000 we get the values of θ, ρ, and the Monte carlo array b of 3 dimension with the first dimension equals to the number of parameters in the gam model. As in the gam model, we take the data in the first component of the jags model as the model.matrix X and the data in first 2 dimension of the b to simulate the fatal infection profiles f = exp(Xb) and get all the necessary statistics such as median with confidence intervals, the peak points of the median profile, the squared second difference of The red curve is the posterior median, the dashed curves delimit an 80% confidence interval and the dotted curves a 95% confidence interval. The scaled black bar chart shows the posterior distribution of the day of peak fatal infection. The dashed grey curve is proportional to the squared second difference of the median infection profile on the log scale which is proportional to the smoothing penalty. The green curve is proportional to the absolute gradient of the infection profile. Bayesian Inference for Generalized Additive Mixed Models Based on Markov Random Field Priors Penalized Structured Additive Regression for Space-TimeData: ABayesian Perspective Regression Models Nonparametric Regression and Generalized Linear Models rjags: Bayesian Graphical Models Using MCMC. R package version 4-6 Semiparametric Regression Generalized Additive Models (with discussion) Generalized AdditiveModels Estimates of the severity of coronavirus disease 2019: a model-based analysis. The Lancet Infectious Diseases Generalized Additive Models: An Introduction with R (2 ed.) Smoothing parameter and model selection for general smooth models (with discussion) Just another Gibbs additive modeller: Interfacing JAGS and mgcv Simple models for COVID-19 death and fatal infection profiles World Health Organization. www.who.int/emergencies/diseases/novel-coronavirus-2019. Accessed Acknowledgement. I am very grateful to Simon S. Wood as his paper [13] was very inspirational to my work.