key: cord-145890-ab4o0xol authors: Giuliani, Diego; Dickson, Maria Michela; Espa, Giuseppe; Santi, Flavio title: Modelling and predicting the spatio-temporal spread of Coronavirus disease 2019 (COVID-19) in Italy date: 2020-03-14 journal: nan DOI: nan sha: doc_id: 145890 cord_uid: ab4o0xol Official freely available data about the number of infected at the finest possible level of spatial areal aggregation (Italian provinces) are used to model the spatio-temporal distribution of COVID-19 infections at local level. Data time horizon ranges from 26 February 20020, which is the date when the first case not directly connected with China has been discovered in northern Italy, to 18 March 2020. An endemic-epidemic multivariate time-series mixed-effects generalized linear model for areal disease counts has been implemented to understand and predict spatio-temporal diffusion of the phenomenon. Previous literature has shown that these class of models provide reliable predictions of infectious diseases in time and space. Three subcomponents characterize the estimated model. The first is related to the evolution of the disease over time; the second is characterized by transmission of the illness among inhabitants of the same province; the third remarks the effects of spatial neighbourhood and try to capture the contagion effects of nearby areas. Focusing on the aggregated time-series of the daily counts in Italy, the contribution of any of the three subcomponents do not dominate on the others and our predictions are excellent for the whole country, with an error of 3 per thousand compared to the late available data. At local level, instead, interesting distinct patterns emerge. In particular, the provinces first concerned by containment measures are those that are not affected by the effects of spatial neighbours. On the other hand, for the provinces the are currently strongly affected by contagions, the component accounting for the spatial interaction with surrounding areas is prevalent. Moreover, the proposed model provides good forecasts of the number of infections at local level while controlling for delayed reporting. Many scholars are trying to give a contribution to the problem, both in open online venues and on scientific publications, debating about reproductive number, mortality distributed for age and gender and, more in general, epidemic features, without neglecting to avail of mathematical and statistical models. The main contributions use deterministic epidemiological models (see Kucharski et al., 2020; Liu et al., 2020a; Sun et al., 2020, among others) , all focussing on the time evolution of the phenomena, especially with predictive purposes. The prediction of new infected, deceased and healed is certainly essential for health policy in order to estimate the capacity of a health system to cope with the stress caused by a pandemic. Nevertheless, in the recent literature the relevance of space in the diffusion of the COVID-19 is treated only marginally (Chinazzi et al., 2020; Li et al., 2020b) , although the importance of spatial and spatio-temporal autocorrelation in epidemiology was already highlighted in the seminal book by Cliff and Ord (1981) . The authors show the effectiveness of statistical methods in analyzing the incidence of some epidemics, such as e.g. measles in Cornwall, 1969 Cornwall, -1970 in London in 1849 and tuberculosis and bronchitis in Wales, 1959 Wales, -1963 In the last 20 years spatial epidemiology has been evolving rapidly and it has found a great response in medical applied research (for a recent review on methods and applications in the field, see Kirby et al., 2017) . The importance of the space in the study of disease transmission, like all natural phenomena, answers to the first law of geography (Tobler, 2020) , for which everything is related to everything else, but near things are more related than distant things". Reformulating this concept, it could be said that the phenomenon external to an area of interest affects what goes on inside. In light of this, we found imperative to include a component that can account for spatial dependence among areal units in the study of COVID-19 diffusion, in order to explain the strength and direction of the spreading on the territory as well as in time. This is particularly relevant in countries such as Italy or Germany in which the local health systems interested by the disease are strongly regionalized. The territorial specificity, which can result in more or less drastic containment measures, cannot be neglected in a model construction that has the ambition to explain how the contagion moves in space and time and to allow reliable spatio-temporal predictions. The purpose of the present paper is to model and predict the number of COVID-19 infections, drawing out the effects of its spatial diffusion. Forecasts about where and when the disease will occur may be of great usefulness for public decision-makers, as they give them the time to intervene on the local public health systems. The Italian Department of Civil Protection according with the Italian Ministry of Health, has started to release a daily bulletin about COVID-19 infections in Italy since 26 February 2020. By 3 March 2020 data were also released at NUTS-3 level (provinces), 1 having been previously released only at the highly aggregated NUTS-2 level. Data are available from the websites of the main Italian newspapers, such as, Il Sole 24 ore. 2 Therefore, immediately after the release, we were able to collect the infectious disease count time series at provincial level for the period 1 NUTS (Nomenclature of Territorial Units for Statistics) is a European Union geocode standard for referencing the subdivisions of countries for statistical purposes, established by Eurostat in agreement with each member state. The level of NUTS are four, nested into each other: NUTS-0 are the sovereign countries; NUTS-1 are the major socio-economic regions; NUTS-2 are basic regions for the application of regional policies (Italian regions); NUTS-3 are small regions for specific diagnoses (Italian provinces). 2 https://lab24.ilsole24ore.com/coronavirus. from 26 February 2020 to 18 March 2020 using two sources. Since spatial dependence may result stronger than it actually is at a coarse spatial resolution (Arbia et al., 1996) , we have wanted to use data at the provincial level of aggregation, which is the freely available finest possible level of aggregation. For the first six days, data were provided only to media outlets and published by newspapers in the form of interactive maps. For this period, we have used web scraping techniques to track raw data upon which maps have been constructed. For the remaining days, we simply downloaded data about provinces released by the Department of Civil Protection. Clearly, at this stage, data suffer from several problems of quality. The number of contagions is continuously updated and positive swabs need to be confirmed by the National Institute of Health, not to mention several cases of delayed reporting by the local authorities. Therefore, the results presented in the paper may benefit from availability of more accurate data. The overall temporal distribution of daily counts of COVID-19 infections is depicted in Figure 1 , 3 while the spatial distribution is showed in the map in Figure 2 . The two plots indicate that while the growth of cases over time in Italy has been globally exponential, it is characterized by strong local heterogeneity and also by a relevant spatial pattern in the form of a hot-spot in the northern part of the country (the plot of the time series disaggregated for all Italian provinces is reported in the Appendix). The evolution of the number of daily infections is studied by means of a statistical model which has been adapted from that proposed in Held et al. (2005) and Paul and Held (2011) , where both temporal dynamics of the infections and its geographical mechanism of spread are considered and estimated (see Adegboye and Adegboye, 2017; Cheng et al., 2016 , for an implementation of the model in epidemiology). The expected number of infections occurred in a province in the course of a day may be decomposed according to three additive components representing three different channels through which the contagions may increase. The first component of contagion is represented by the spread mechanism of the COVID-19 within each province, which depends on the cumulative incidence of the disease and determines the speed of diffusion of the virus in the future. This component determines the temporal dynamics of the contagion within each province and for this reason is referred to as epidemicwithin component (see Appendix). Another channel is attributable to contagions originated from other provinces, and in particular from provinces which are geographically close to each other. The origin of such spatial dimension of contagion can be found in the high mobility of people across provinces, and substantially contributes to explain the observed diffusion of the virus from a limited number of focuses to a wider area which, at the time we are writing, heavily affects the most part of northern Italy. In the following, this source of contagion is referred to as epidemic-between component, as it concerns the inter-province spread of COVID-19 (see Appendix). The last portion of cases should be attributed to province-specific conditions which determined the first centres of infections and the initial exposure to the risk of contagion. Such a local component of the overall daily province infections is referred to as endemic component, according to terminology introduced in Paul and Held (2011) , and which, in this context, does not implies any epidemiological qualification of the COVID-19 in the population of Italian provinces. We point out that in this paper both terms epidemic and endemic have been inherited from Paul and Held (2011), whereas terms within and between are introduced in order to distinguish between the temporal and spatial terms which Paul and Held (2011) jointly refers to as epidemic component. The statistical model is structured in three parts which contribute to the number of daily infections per province according to the three described components. Each part of the model considers also the presence of heterogeneity both in the temporal and in the geographical mechanisms of contagion, so that the different dynamics existing inside the provinces, both in terms of temporal and geographical mechanisms spread of COVID-19 are modelled. All analyses presented in this paper were performed using the R software (R Core Team, 2018). The estimated model provides useful insights about the evolution and spread of COVID-19 occurrences across the territory of Italy. The most striking evidence is that the epidemic potential across areas is, on average, very strong; however, it is also highly heterogeneous among provinces. To better assess the degree of spatial heterogeneity, Figure 3 shows three maps depicting the decomposition of the estimated expected number of infections into its three components -namely the within-epidemic, between-epidemic and endemic contributions -by provinces. For each province, the three fitted components are expressed as proportions of their sum. We can clearly see that only few provinces, the most affected by the disease until now, are mainly influenced by the local endogenous transmission of the contagion (map on the left). For a relatively higher number of provinces, mostly located in the north and center parts of the country, a relevant number of cases is instead explained by the transmission from neighboring provinces (map on the centre). For the majority of provinces, located mainly in the south, the contagions follow essentially the endemic trend (map on the right). The comparison among provinces in terms of the three investigated components is robust to spuriousness because the Italian population is entirely and homogeneously susceptible, that is immunologically "naive", to this new virus. Moreover, being vaccines still missing, there are no differences among provinces deriving from vaccination coverage. To gain further insights about the relative importance of the three components we take into consideration some paradigmatic provinces located in the part of the country with the highest number of cases, that is the northern areas (see Figure 4 ). In particular, we focus the attention on the northern part of Italy because after the illness started to occur in the country, it took several days to spread throughout the territory, thus leading to an unbalanced spatial distribution of infections. Figure 5 depicts the mean number of cases estimated by the model along with the observed number of cases for the paradigmatic provinces. First of all, the province of Lodi (Lombardy region), which is the first area that reported cases, sees a predominance of the component related to the internal diffusion of the disease. This evidence agrees with the quarantine to which some of its municipalities underwent since the beginning of the outbreak and with the hypothesis that the epidemic in Italy started right there. The province of Lodi shares the eastern border with the province of Cremona (Lombardy region), that in turn shares the northern border with the province of Bergamo (Lombardy region) as depicted in Figure 4 . The latter two are so far some among the most severely affected provinces in Italy. Their proximity with the province of Lodi makes them widely susceptible to the contagion effect between neighboring regions. This suggests that the contagion of infections among inhabitants occurred before containment measures were carried out and that the disease remained undetected before patient 1 was discovered. A similar situation arose for the province of Parma (Emilia-Romagna region), which share the norther border with the province of Cremona. Even here the between-epidemic component is relevant and it may be explained with the strong social and economic ties existing among the areas. The second in time discovered outbreak of COVID-19 in Italy was detected in a small town in the province of Padova (Veneto region). Even in this case, strong control measures to reduce transmission outside have been carried out, as evidenced by the almost total influence of the within-epidemic component. This shows that there have been no "return infections" from the surrounding areas but it does not mean that, before the quarantine, the infection could not have circulated in the neighboring provinces. In fact, the province of Venice (Veneto region), which is geographically, culturally and economically near to that of Padova demonstrates to be affected by these form of interactions. The cases of the provinces near the national borders are also of particular interest, such as the provinces of Turin (Piedmont region) and Trieste (Friuli Venezia Giulia region), adjoining with France and Slovenia, respectively. These provinces may be affected by the so-called boundary problem for which the analysis of a phenomenon may be biased by the presence of arbitrary administrative geographic borders. Thus, on one side we might lose information regarding what happens beyond the country border, due to different ways of collect information and monitor the diffusion of the disease. One the other side, we might not have a clear overview about cross-border phenomena as job trajectories. In addition, the two considered provinces seem to be not connected with the aforementioned outbreaks originated in Lombardy and Veneto regions. Due to these reasons, the provinces of Turin and Trieste show a limited effect of epidemic components and a prevalence of the endemic component, which is due only to the evolution of the disease over time. Looking at the national aggregated counts, the exponential evolution of COVID-19 contagions is clear. On the country as a whole, the main component seems to be the temporal trend but also the epidemic components have a not negligible role. The observed decomposition is probably mainly due to the central and southern provinces, which are characterized almost exclusively by the spread over time component. Moreover, to check the ability of the proposed model in explaining the spatio-temporal distribution of the COVID-19 occurrences in Italy, we examine how it is good at predicting the future daily counts of infections. Therefore, we re-estimate the model using the data between 26 February 2020 -17 March 2020 and we make prediction for 18 March 2020. The observed total number of cases at 18 March 2020, obtained by aggregating the province data, is 4204; the model provides a prediction of 4191 cases. Therefore, the model underestimates by only 14 units. Naturally, the advantage of the proposed model is that it provides predictions for each individual province (point predictions can be found in Tables 2 and 3 of the Appendix) but it is clear that the use of predictions at local level can have positive effects on the prediction of total number at country level. We report (in Figure 2 of the Appendix) the 80% prediction intervals of the provincial counts of infections at 18 March 2020. Given the brevity of the observed time series, to obtain a reasonable degree of precision we cannot require higher confidence levels. Despite the fact that the intervals are quite wide, the results are promising if compared with the known observed counts. We expect that the level of precision will improve as the data are updated and the observed time series become longer. In this article, we analyzed, by modelling, the trend of COVID-19 epidemics along time and space. The use of spatio-temporal models can greatly improve the estimation of number of infected and can help the public decision-makers to better plan health policy interventions. Italy has viewed a massive spread of the disease with peculiar patterns on the territory. Started from some provinces in the northern area, this serious illness descended down the Boot and is nowadays present in all 107 Italian provinces. Several discrepancies among provinces, especially between north and center-south of the country, persist but coming weeks seem to be decisive to understand the behavior of the contagion trend. Containment measures in Italy have followed an application in three steps: first, some municipalities in Lombardy and Veneto regions underwent to quarantine; second, the entire Lombardy and some provinces in other northern regions (Veneto, Piedmont, Emilia-Romagna and Marche) were isolated from the rest of the country; third, all Italian territories were subjected to a complete lockdown. Such stepwise approach in imposing hard control measures to the entire national territory might have conducted to a temporal shifting of the contagion dynamics. Emblematic is the case of the province of Lodi, the first area submitted to the quarantine. After an average increment of 85 cases when the illness began to be systematically discovered, in the last 5 days of the time series, which correspond to three weeks after the lockdown, this province seems to have settled the average number of newly discovered contagions equal to 42 cases (so the contagions have been halved). This fact suggests that policies of contagion containment seem to exert a mild success so far in the areas in which there was an effective enforcement of the control measures. Instead, in metropolitan areas where the population density and the more active social behavior make more difficult to comply properly to the recommendations, the whished effects are not yet taking place. This is the case of the province of Milan, which is experiencing an increasing number of cases in last days (from an average of 9 new cases detected of the beginning to an average of 273 new cases of last days). And then there is the case of center and southern provinces, which have viewed a delayed start of the epidemics but also the arrival, in the last two weeks, of flows of people escaped from northern regions undergone to lockdown. As an example, the province of Florence went from an average increment of 3 new cases of the first period to an average increment of 41 new cases of last 5 days, and, similarly, the province of Naples, has gone from an average increment of 4 new cases to an average increment of 29 new cases, confirming the hypothesis of downhill race of the disease along the peninsula. All this has led to the failure in applying homogeneous measures at national level, also due to regional autonomy in force in Italy and to important local differences, which may lead to delays or failures in containing the contagions from a global point of view. This evidence suggests that consider the peculiarities of local territories may be effective in planning specific strategies for enforcing the containment measures established at national level. These differences in the dynamics of the epidemics in Italy demonstrate the crucial importance of a strong national coordination level for the homogeneous enforcement of control measures, but also reveal how essential predictions at local level are. Since the epidemics started, a frequent question of the public decision-maker is about when the peak of contagions will manifest. Probably a more appropriate request should concern the emergence of different local peaks in different moments, which Italy should expect in next weeks. The dramatic events in northern provinces can serve as a test bench for the health system, offering an overview about what southern provinces might be confronted. Analyses and predictions both in space and time offer a decisive perspective about which areas may be more affected and when, given the time to the decision-makers to intervene on the local policies. DG and FS were responsible for statistical modeling, data analysis and writing of the manuscript. MMD and GE were responsible for result presentation, discussion and writing of the manuscript. All authors reviewed, revised and approved the final manuscript. • None of the authors of this paper has a financial or personal relationship with other people or organizations that could inappropriately influence or bias the content of the paper. • No competing interests are at stake and there is no conflict of interest with other people or organizations that could inappropriately influence or bias the content of the paper. The data about the spatio-temporal distribution of Coronavirus disease 2019 infections at province level consist in multivariate count time series whose spatial references are in the form of irregular spatial lattices. Therefore, the proper regression modelling framework for this empirical circumstance is the class of the so-called areal Generalized Linear Models (GLMs). By extending the seminal model originally introduced by Held et al. (2005) , Paul and Held (2011) proposed an endemic-epidemic multivariate time-series mixed-effects GLM for areal disease counts, which proved to provide good predictions of infectious diseases (see Adegboye and Adegboye, 2017; Cheng et al., 2016 , among the others). The main equation of the model describes the expected number of infections µ r,t in a region (province) r at time (day) t as follows: where Y r,t is the number of infections reported in the region r at time t, which follows a negative binomial distribution with region-level overdispersion parameter ψ r . If ψ r > 0 the conditional variance of Y r,t−1 is µ r,t (1 + ψ r µ r,t ), while if ψ r = 0 the negative binomial distribution reduces to a Poisson distribution with parameter µ r,t . The three terms on the right-hand side of Equation (1) correspond to the three components of the model: the epidemic-within, the epidemic-between, and the endemic. The first component models the contribution of temporal dynamics of contagions to the expected number of infections within region r. The term includes the number of infections observed in the previous day (time t − 1), which affect µ r,t depending on the value of the coefficient λ r > 0. As the notation suggests, λ r changes amongst the provinces because of a random effect which allows for heterogeneous behaviour in the dynamics of contagions. The epidemic-between component models the contagion between neighbouring provinces by including the average incidence of the infections Q r ,t−1 of provinces r which are neighbours of province r. In particular, the coefficients w r ,r in the summation r =r w r ,r Q r ,t−1 are positive if either province r and r share a border or province r and r share a border with the same province, whereas w r ,r is zero otherwise. The coefficient φ r determines the magnitude of the effect of inter-province spread of contagion, and changes amongst provinces according to the population as well as to unobserved heterogeneity in the diffusion process. The last component determine the province-specific contribution to the number of infections, once the temporal and spatial autoregressive effect are accounted for. The term e r is the population of province r, whereas term ν r,t consists of a national time trend component, and a province-specific effect depending on the share of population over 65, and on a random effect which catches the heterogeneity due to unobserved factors. Paul and Held (2011) suggested that the endemic and epidemic subcomponents can be modelled themselves through log-linear specifications: log(ν r,t ) = α (ν) where the α (·) r parameters are region-specific intercepts and z (·) r,t represent observed covariates that can affect both the endemic and epidemic occurrences of infections. The varying intercepts allow to control for unobserved heterogeneity in the disease incidence levels across regions due, for example, to under-reporting of actual infections (Paul and Held, 2011) . Given the regionally decentralized health system in Italy, non-negligible differences in case reporting of COVID-19 infections among Italian regions are very likely, which make the opportunity to have regionspecific intercepts very important. Following Paul and Held (2011) region-specific intercepts can be obtained through the inclusion of random effects. In particular, we assume here that . The Paul and Held (2011) model with normally distributed random effects can be estimated through penalized likelihood approaches that have been implemented in the R package surveillance (Meyer et al., 2017) . See Paul and Held (2011) for futher details. Given the brevity of the observed time series, the epidemic-within autoregressive parameter is assumed to be constant over time and, in absence of useful exogenous covariates, the model of Equation (2) takes the form log(λ r ) = α (λ) r , that is, the "internal" infectiousness depends only on a spatially varying intercept. Following Meyer and Held (2014) , the subcomponent model for the epidemic within autoregressive parameter, Equation (3), is here specified as 3 log e r , which accounts for the fact that the regions may have different propensities to be affected by the other neighbouring regions, and this may depend by their resident population share. The rationale is that the more populated regions tend to be more susceptible to transmission across regions. Since some first recent empirical evidences suggest that the number of COVID-19 infections seems to grow exponentially over time (Liu et al., 2020b; Maier and Brockmann, 2020) , the endemic component model assessing the temporal dynamic of disease incidence, Equation (4), is specified as a second-order polynomial log-linear regression: 4 3 log(a r ), where t = 1, 2, . . . is the time in days and a r is the proportion of inhabitants over 65 years old. In the global model of Equation (1) the endemic predictor ν r,t is multiplied by the offset e r , which in our case is the regional share of resident population. 4 We do not include higher-order terms to avoid the risk of introducing spurious endemic patterns and overfitting. Spatially correlated time series and ecological niche analysis of cutaneous leishmaniasis in afghanistan Effects of the maup on image classification Analysis of heterogeneous dengue transmission in guangdong in 2014 with multivariate time series model The effect of travel restrictions on the spread of the 2019 novel coronavirus (COVID-19) outbreak Spatial processes: models and applications Clinical characteristics of coronavirus disease 2019 in China A statistical framework for the analysis of multivariate infectious disease surveillance counts Advances in spatial epidemiology and geographic information systems Early dynamics of transmission and control of COVID-19: a mathematical modelling study. The Lancet Infectious Diseases Early transmission dynamics in wuhan, china, of novel coronavirusinfected pneumonia Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (COVID-19) Evolutionary history, potential intermediate animal host, and crossspecies analyses of SARSCoV2 The reproductive number of COVID-19 is higher compared to SARS coronavirus The reproductive number of COVID-19 is higher compared to SARS coronavirus Effective containment explains sub-exponential growth in confirmed cases of recent covid-19 outbreak in mainland china Power-law models for infectious disease spread Spatio-temporal analysis of epidemic phenomena using the r package surveillance Authors thank Umberto Agrimi of the Istituto Superiore della Sanit (Italian National Institute of Health, Rome), who read a preliminary version of this article and provided several valuable comments and suggestions.