key: cord-0641263-xerodzi8
authors: Bonifazi, Gianluca; Lista, Luca; Menasce, Dario; Mezzetto, Mauro; Oliva, Alberto; Pedrini, Daniele; Spighi, Roberto; Zoccoli, Antonio
title: A statistical analysis of death rates in Italy for the years 2015-2020 and a comparison with the casualties reported for the COVID-19 pandemic
date: 2021-02-02
journal: nan
DOI: nan
sha: a3e39929e70c220c04216627bbb42f535f5882bc
doc_id: 641263
cord_uid: xerodzi8

We analyze the data about casualties in Italy in the period 01/01/2015 to 30/09/2020 released by the Italian National Institute of Statistics (ISTAT). The data exhibit a clear sinusoidal behavior, whose fit allows for a robust subtraction of the baseline trend of casualties in Italy, with a surplus of mortality in correspondence to the flu epidemics in winter and to the hottest periods in summer. While these peaks are symmetric in shape, the peak in coincidence with the COVID-19 pandemics is asymmetric and more pronounced. We fit the former with a Gaussian function and the latter with a Gompertz function, in order to quantify number of casualties, the duration and the position of all causes of excess deaths. The overall quality of the fit to the data turns out to be very good. We discuss the trend of casualties in Italy by different classes of ages and for the different genders. We finally compare the data-subtracted casualties as reported by ISTAT with those reported by the Italian Department for Civil Protection (DPC) relative to the deaths directly attributed to COVID-19, and we discuss the differences.

: The distribution of deaths collected by ISTAT [1] from 7903 districts in Italy between the years from January 1 st 2015 to September 30 th 2020. These data include both sexes and all ages. An annual modulation of the counts is evident with maxima corresponding to winter seasons and minima to summer.

The data are collected by gender, age and location for each individual death. Figure 1 shows the number of deaths in all the categories in the considered period. What is already striking by a simple visual inspection of the distribution is a periodic seasonal variation that behaves approximately like a sinusoidal wave of constant amplitude on top of an equally constant offset value 1 . This feature remains partly confirmed also by disentangling the data using age as a selection criteria. Fig. 2 shows the distribution of deaths for people in three different age classes: in blue those below 50 years, in green those in range 50 to 79; in red those above 80 and in magenta the sum of all these three classes. It is evident from these distributions that people with an age below 50 die, to a good approximation, with a constant average probability in any given day of the year while those above that age tend to have a varying, periodic probability of death with a maximum in winter and a minimum in summer. The older the age, the larger the excess of death in particular periods of the year, appearing in the form of Gaussian-like excesses over the sinusoidal wave. Disentangling the data by gender, see Fig. 3 , there seems to be a slight prevalence of female deaths with respect to males, except for the COVID-19 peak, where the situation happens to be reversed. These are just raw values, though, not corrected to take into account the ratio between males and females in the Italian population. Later on we will quantify and appropriately weight these data.

We perform a global fit over the data, where in a single step we estimate the sinusoidal baseline of the distribution, the seasonal excesses and the 2020 peak in correspondence of the COVID-19 pandemic. This method differs from other methods often reported in literature, where the background is subtracted by computing the average number of counts in the same period of the past 5 years, as in [8] , or by fitting the sinusoid with the data during periods of time where the excesses are not visible, like in spring or autumn [7, 9] .

We used a χ 2 fit to interpolate the data with an appropriate function meant to model the data in order to determine the value of the unknown parameters of the model along with their uncertainties. The actual minimization is carried out by the MINUIT [10] package.

The overall fit function has been defined as the sum of individual components in the following form:

1 In the following we will discuss how we established that there is no significant slope of the average value of this wave. 

where s(t), g i (t) endĠ(t) are defined and described below. The s(t) function is meant to model the wave-like variation of deaths with seasons, the g i (t) function describes the excess peaks visible above the wave andĠ(t) represents the rightmost excess peak (spring 2020), which, unlike the others, is asymmetrical. The index i runs from 1 up to k, the number of excess peaks featured by the data distribution except the last one on the far right.

The general wave-like behavior of the data is modeled by a sinusoidal function of the form:

where t is the day number starting from t 0 = 1/1/2015. The parameter c(t) represents the slowly-varying offset from zero deaths, a the amplitude of the oscillation, T is the period of variation (the time delay between consecutive maxima) and finally ϕ the phase. We tried to model c(t) allowing for a linear dependence on t, as c(t) = c 0 + c 1 t, but the fit determines a slope c 1 compatible to zero within uncertainty. We therefore decided to maintain the c term independent of time in the final fit.

Each individual excess above this s(t) wave can be modeled by a Gaussian distribution of the canonical form:

The choice of a Gaussian function here is only justified by being the simplest symmetrical function to describe these excesses representing, at the same time the distribution of a random variable. Modeling the excess peaks in the described way has the advantage that the individual g i fit parameters correspond to a Gaussian with the background contribution already taken into account in the overall fit model. The N i parameter of each Gaussian, corresponds to the number of excess deaths with respect to the wave-like background, whose values are also determined optimally by the fit itself. An advantage of this approach is that in the case of adjacent, overlapping Gaussians (as can be seen in Fig. 4 in the case of the g 6 end g 7 peaks but also the g 11 and the big peak on the far right of the distribution), each individual area is computed correctly by taking into account the nearby contributing ones.

While the excess peaks look highly symmetrical around their maximum and can thus be reasonably well modeled with Gaussians, as described before, the peak of the spring 2020, associated with the COVID-19 pandemic, is clearly asymmetric. We have tried several possible parametrizations for that distribution, such as bifurcated Gaussians with a common peak, generalized logistics, or else, to reflect the asymmetry, but in the end we resolved to adopt the derivative of a Gompertz function [11] simply because it is customarily adopted by epidemiologists to describe epidemic evolution's over time and we therefore considered it more suitable to our purpose.

A Gompertz function is parametrized in the following way:

Equation 4 represents a cumulative distribution. Since our data represent instead daily counts, we used its derivative, given by:Ġ

where the parameter N G is the value of the integral of this function.

In Figure 4 and Tables 1 to 3 we report the results of a fit to the whole data sample, comprising both genders for all ages in the six years from 2015 to 2020. The column labeled 'µ i ', in Table 1 , indicates the day when the maximum of an excess has been reached while those labeled 'µ i ± 2σ i ' indicate, respectively, the day of onset and demise from the average background, a time interval in which occur 95% of the death cases (expressed with calendar dates). The column labeled 'Duration' is the time difference between onset and demise (namely 4σ, expressed as number days). Table 1 . Table 2 : Results of the fit to the whole data set (no selection applied) for the baseline sinusoidal wave, as modeled by eq. 2 .

The pulls, p i , are defined as:

where d i is the number of death counts in a given day i and i the corresponding amount of statistical fluctuation. The data, being outcomes of counting values, are assumed to follow a Poisson distribution, hence i = √ d i . The χ 2 /n DOF of the fit turns out to be 3.271. We report the distribution of the pulls in Fig. 5 fitted with a Gaussian function. The mean value of the fit is −0.01 ± 0.04, compatible with zero, while the standard deviation of the Gaussian fit turns out to be 1.75 ± 0.04, confirming the significant underestimate of the uncertainties. This deviation from unity, of about 75%, gives an approximate amount of the increase that should be applied to the data errors.

The area of each Gaussian function i is given by the fit parameter N i defined in eq. 3, while the area of the Gompertz derivative is the fit parameter N G in eq. 4.

From Peak To Duration (days) 54 387 ± 557 20/2/2020 24/3/2020 11/5/2020 81 Table 3 : Results of the fit to the whole data set (no filters applied) for the Gompertz derivative function. The meaning of the columns labeled From, Peak and To is explained in the text.

The width of the Gompertz is computed from the first day in which the integral of the function exceeds 2.5% of the total to the day in which the integral reaches 97.5% of the total. These two days are reported in Table 3 under the columns labeled From and To. Table 2 , it turns out that the peak of the sinusoid (the maximum number of deaths) falls on January 31 st of every year.

These results highlight an interesting feature of the COVID-19 deaths excess. As already noted, almost every winter there is an excess of deaths with respect to the baseline, with the notable exception of the years 2015-2016 (a period with a particularly balm winter[18], with a relatively small value of 4 455 excess of casualties). The peak in the spring of 2020, instead, shows characteristics markedly different from the winter excesses of previous year in terms of amplitude, width and day of the year when the maximum is reached. In the following we will mention the possible implications of these differences.

We have also disentangled the data by age and gender and fit the distributions in these different categories to obtain accurate numerical values. Figure 6 : Casualties of people with age in the range 50 ≤ age ≤ 59 with a superimposed fit based on function 6 (blue points) while gray points are the category 0 < age ≤ 49. Numerical results are listed in Table 4 and 5. The plots at the bottom show the pulls of the two samples.

We start with a cumulative plot for all people aged between 50 (included) and 60 years (excluded) who died between 2015 and 2020, shown in Fig. 6 . This plot shows that the average value of daily deaths for people in this age range is about 70 casualties/day. In order to get a fit comparable with the one in Fig. 4 , we are forced to adopt a somewhat more stringent fit strategy. The wave parameter corresponding to the phase has been fixed to the value Table 4 : Results of the fit to the data set of people aged between 0 and 60 (excluded). These values correspond to the fit depicted in Fig. 6 established for the full data sample (the other three are left free to float in the s(t) function). This guarantees that maxima are reached in the winter and minima in the summer and no spurious time translation is introduced by the fit procedure when a local minima can eventually be found. Also the peak position and the width of the 13 Gaussians have been fixed to the values established for the whole data sample while the Gompertz parameters are all left free. The corresponding fit results are listed in Tables 4 and 5 .

Yield From Peak To Duration (days) 1 373 ± 87 28/2/2020 24/3/2020 1/5/2020 63 Table 5 : Results of the fit to the whole data set (no filters applied) for the Gompertz derivative function. These values correspond to the fit depicted in Fig. 6 .

The picture shows two categories of age at the same time: those in the range 0 − 49 (in gray) do not show any sign of seasonal variation around the mean value of ∼ 32/day (they were fit with a simple constant term). A sinusoidal variation begins to be noticeable only in the range 50 − 59 (blue dots), along with the presence of the corresponding death excesses indicating a continuous increase in magnitude with age starting around 50. The results are affected by larger uncertainties with respect to the full sample of Fig. 4 , reflecting the smaller size of population in this range.

The excess peaks and the sinusoid amplitude become more evident in a sample of even higher ages, namely 60 ≤ age < 80. The average number of deaths in this category is much larger, due to an enhanced health fragility for people of progressively higher age, as seen in Fig. 7 . Figure 7 : Casualties of people with 60 ≤ age < 80 (with a superimposed fit based on eq.1). Numerical results are listed in Table 6 and 7 The fit is again pretty similar, in shape but not in amplitude of course, to the full sample shown in Fig. 4 . The pulls feature a mean value compatible with zero also in this case. The fit strategy is the same as the one described before for Fig. 6 . Values obtained in this case are listed in Tables 6 and 7. The average death rate in this category is ∼ 72/day. Increasing the age threshold further up, by collecting deaths of people aged ≥ 80, we get a sample with very pronounced peaks, see Fig. 8 . The average death rate in this last category reaches the high value of ∼ 1070/day. Another information can be extracted from the data is the relative amount of deaths between genders. Fig 9 show the distribution of males and females (summed over all ages) superimposed with the relative fits. In this case, since the two samples have a rather large statistical amount, both fits have been performed with all parameters free to vary. These numbers need to be corrected by the relative number of males and females in the Italian population. The fraction of males in 2020 was 48.7% while females were 51.3% [12]: we compute a mortality factor (for each gender) by normalizing the yields to 29 050 096 and 30 591 392 (the respective number of males and females of the total Italian population by January 1 st 2020). The resulting values (multiplied by 100000) are listed in Table 10 under the columns Mortality. While the absolute number of female deaths is higher than the males one every year, the opposite seems true for the 2020 peak. After re-weighting this small discrepancy between genders remains basically Table 8 true for all peaks except the 2020 one, where the mortality turns out to be larger for males than for females. The fraction of casualties for the two genders turn out to be about the same, at the level of one standard deviation in all the years till 2019 included. Table 10 : Results of the fit to the data set divided in a sample of males and another of females (of all ages) in Fig. 9 .

The meaning of the Mortality column is described in the text. Table 11 : Results of the fit to the data set (for the Gompertz peak only) divided in a sample of males and another of females (of all ages) in Fig. 9 . The meaning of the Mortality column is described in the text.

The data set provided by ISTAT [1] and used for the present analysis is not the only one publicly available: another is provided by the Dipartimento della Protezione Civile (DPC) [5] , which provides a somewhat different kind of information regarding the number of deaths in the context of the COVID-19 pandemic. In particular, the data record, which begins February 24 th 2020, contains the number of daily deaths directly attributed to the current pandemic.

A plot of the data from these two disparate sources is shown in Fig. 10 . The magenta points (and the accompanying fit result of a Gompertz function in red) correspond to the ISTAT data sample: these data are a subset of those displayed in Fig. 4 , specifically those between the dates of February 24 th and September 30 th 2020, with the entries in each bin replaced with the difference between the actual counts and the contribution due to the underlying wave. This subtraction of the background of the data allows for a direct comparison between the ISTAT and DPC data, the latter doesn't requires a subtraction procedure being unaffected by a background.

The DPC sample is shown as blue dots (with the corresponding Gompertz fit superimposed in green). A clear peak is visible around spring 2020 together with a second one during fall 2020, corresponding, respectively, to the first and the second wave of the 2020 pandemic. It is worthwhile to note that the DPC data reports the day when the death was finally registered, unlike the case of the ISTAT data, which records the actual day of death, thus introducing a potential delay of a few days between the two samples, visible as a translation of the green line with respect to the red one.

The DPC data show a spike corresponding to August 15 th , due to the fact that a certain number of deaths were not correctly reported in the preceding weeks and were recovered assigning that day as the actual death date. In order to compare the yield returned by the fit to the value provided by the ISTAT data we had to exclude the contributions from the second pandemic peak: we decided to introduce a cutoff value while computing the sum of entries of the DPC sample in correspondence to August 16 th , a day when the minimum number of casualties was reached between the two pandemic waves, therefore including also the spike. The cutoff date is shown in Fig. 10 as a vertical green arrow.

The sum in that period (the blue dots) results to be 35 468.

On the other hand, the yield obtained for the ISTAT sample is the one reported in Table 3 , namely 54 387 ± 557, resulting from the integral of the Gompertz peak (the yields of two peaks at around July and August are therefore not included). The difference in the number of deaths from these two samples amounts to 18919±557. This strikingly large difference could be due to several different reasons, such as an excessive pressure on the Italian health system in the early stages of the pandemic which prevented a certain number of patients with diseases other than COVID-19 to be safely treated in hospitals and emergency rooms. We have no elements in the data that can allow us to discern the different contributions to this discrepancy.

The rich data sample provided by ISTAT allows for various additional visualizations. In Fig. 11 we display data for ages in groups of 4 years to visualize the increase of death probability with age: it becomes more evident what was already shown in Fig. 2 , namely the fact the young people tend to die with a rather flat probability along each year, while people of a progressively higher age tend to suffer more from illnesses in specific seasonal periods. Each bin in this plot constitutes the number of deaths in six contiguous days. In Fig.12 we present a scatter plot of death rates as a function of the day of the year (for the six years from 2015 to 2020) versus the age category. This graphical representation clearly illustrates the higher probability of death for the age category 70-95 with respect to the others.

Each value in the ISTAT data sample comes with a geographical tagging marker, allowing for a categorization of the number of deaths in different parts of Italy. Fig. 13 shows the fits for each of the four zones in Italy, namely North, Center, South and Islands 2 . The fits in Figure 11 : ISTAT data set with disentangled age categories. The data are binned in groups of six days each for an enhanced visualization clarity. Figure 12 : Scatter plot of the class age versus day of death. The dark red spots show the age and the date corresponding to highest values of deaths incidence.

these plots correspond to minimization procedures with all parameters free. Table 12 reports the value of the Gaussian and Gompertz integral for these different regions. In order to compare these values between different zones, in Table 13 we report the same values but normalized to the relative amount of registered inhabitants [17] .

A visual inspection of Fig. 13 shows the magnitude of the peak in the winter/spring of 2020 for the North of Italy which is not matched by a comparably populated peak for the Center, South and Islands. Table 13 confirms this impression numerically: while values of each column, for a given row (normalized by population), are comparable between zones, the value of the Gompertz peak in the North remains much bigger (actually by a factor from 10 to over 20). Table 12 to the number of inhabitants in those same regions.

The data provided by ISTAT allow for a detailed quantitative estimate of the number of deaths excesses with respect to an average baseline. This baseline is represented by a wave-like variation of the number of deaths which turns out to be almost perfectly in phase with the yearly seasonal cycle. In this paper we present a study of these excesses evaluated by a statistical interpolation of the data based on a χ 2 minimization method using a function which is the sum of a sinusoidal wave (to represent the repeated, seasonal variation of deaths) a number of Gaussian distributions to represent the excesses above the sinusoid and, finally, a Gompertz derivative to model the asymmetric peak of spring 2020. In this study we discussed the methodology adopted for the interpolations and analyze different samples by disentangling genders, ages and locations. A comparison has also been carried out between the number of deaths provided by ISTAT from the National Death Registry in the period corresponding to the first wave of the pandemic and the numbers provided by DPC in the same period for the deaths directly attributed to COVID-19. We found a rather large discrepancy, amounting to ∼ 19000 deaths over a total of ∼ 54000.

Public data on deaths in the Italian municipalities for the years 2015-2020, provided by ISTAT

The excess winter deaths measure: why its use is misleading for public health understanding of cold-related health impacts

Estimating the number of excess deaths attributable to heat in 297 United States counties

Monitoring the Covid-19 epidemics in Italy from mortality data

Investigating the impact of influenza on excess mortality in all ages in Italy during recent seasons (2013/14-2016/17 seasons)

MINUIT: a system for function minimization and analysis of the parameter errors and corrections

XXIV. On the nature of the function expressive of the law of human mortality, and on a new mode of determining the value of life contingencies

A simplified estimate of the Effective Reproduction Number R t using its relation with the doubling time and application to Italian COVID-19 data

Universality in COVID-19 spread in view of the Gompertz function

The present work has been done in the context of the INFN CovidStat project that produces an analysis of the public Italian COVID-19 data. The results of the analysis are published and updated dahttps://www.overleaf.com/project/5fe1acd9a42 on the website covid19.infn.it/. We are grateful to Mauro Albani, Marco Battaglini, Gianni Corsetti and Sabina Prati of ISTAT for useful insights and discussions. We wish also to thank Daniele Del Re and Paolo Meridiani for useful discussions. The project has been supported in various ways by a number of people from different INFN Units. In particular, we wish to thank, in alphabetic order: Stefano Antonelli (CNAF), Fabio Bredo (Padova Unit), Luca Carbone (Milano-Bicocca Unit), Francesca Cuicchio (Communication Office), Mauro Dinardo (Milano-Bicocca Unit), Paolo Dini (Milano-Bicocca Unit), Rosario Esposito (Naples Unit), Stefano Longo (CNAF), and Stefano Zani (CNAF). We also wish to thank Prof. Domenico Ursino (Università Politecnica delle Marche) for his supportive contribution.