key: cord-0763953-qoxfaucv
authors: Carrión-García, Andrés; Jabaloyes, José; Grisales, Angela
title: The growth of COVID-19 in Spain. A view based on time-series forecasting methods
date: 2021-05-21
journal: Data Science for COVID-19
DOI: 10.1016/b978-0-12-824536-1.00020-4
sha: dd96eaee08c326d2e27f42acf7116c3bd2a9a7ec
doc_id: 763953
cord_uid: qoxfaucv

The current COVID-19 crisis has required the scientific community to use different approaches to analyze and understand the pattern of the pandemic's evolution. Spain is one of the most affected countries in Europe. It has suffered heavily, with thousands of deaths, and a highly stressful situation that has affected the country's health system and all facets of day-to-day life. The period considered in the analysis runs from March 4, the date of the first official COVID-19 death, to April 7, which can be considered as the end of the pandemic's growth phase. The approach used in the study is time-series analysis, a statistical tool that explores the underlying inertia in the development of social, physical, or natural phenomena to predict their future evolution. ARIMA models have been applied. Accurate prediction models have been obtained, and the relationship between some of the time series used has been identified.

time-series analyses. The first category include [20] methods such as the abovementioned SIR model, based on considering scientific (in this case, epidemiological) knowledge to build a model that is adjusted to real data by optimizing its parameters. The adequateness of SIR model has been discussed by Postnikov [21] .

The second category includes a wide variety of models that take advantage of what we can call inertia in the evolution of time series [22e24] . This inertia is the result of the effect on the measured variable of underlying factors that affect the evolution of the corresponding natural phenomenon. These models do not try to understand this effect, but instead aim to recognize regular patterns in the phenomenon's evolution and use them to forecast it.

The models used here are a well-known stochastic family of models known as ARIMA models. The acronym ARIMA corresponds to Auto Regressive (AR) Integrated (I) Moving Average (MA). The technique applied was initially developed by Box and Jenkins and is often referred to as the Box-Jenkins Analysis (Box et al. [22] , Chapters 3 and 4).

ARIMA models are extremely flexible and can adequately model a wide range of timeseries behaviors with just a few parameters [24] .

To present the model, it is necessary to define the backshift operator B. This is a symbolic operator that reduces the time index of an element in the series:

The ARIMA model is the combination of three components [22, 23] :

-The AR part: z t ¼ 4 1 z tÀ1 þ 4 2 z tÀ2 þ . þ 4 p z tÀp þ a t (35. 1) Using the backshift operator in this expression and moving all terms in z to left side, it can be written as:

where 4 p (B) is a polynomial in B of order p, called autoregressive polynomial. It represents the influence of the past p values of the series on its present value (Eq. 35.1) (thus it's named auto-regressive). The random noise element a t is considered to be a white noise process. -The MA part of the model: z t ¼ a t À q 1 a tÀ1 À q 2 a tÀ2 À . À q q a tÀq (35.2) Again, using the backshift operator in this expression, it can be written as:

where now, q q (B) is a polynomial in B of order q (MA polynomial), representing the impact in the present time-series value of the last q random inputs (Eq. 35.2) represented by a t . -And finally, the Integrated part (I), corresponding to the differences required to stabilize the series (converting the original series into a stationary one)

is called the difference operator by its effect, as it generated the difference between one element of the time series and its previous one.

When all these three elements are put together to obtain an ARIMA model, the abbreviated expression is:

Or more detailed:

And this model is referenced as an ARIMA(p,d,q) model. Orders p, d, and q may have any integer value, but are only greater than 4 or 5 in a few cases, and they are usually equal to or lower than 2.

There is an extension of ARIMA models that includes the option of cyclical evolution patterns. In this case, completely parallel models can be defined with the specificity of using a new seasonal backshift operator B s , instead of the B backshift operator, where s is the length of the seasonal period (i.e., 7 for daily data with weekly seasonality, 12 for monthly data with yearly seasonality): B s z t ¼ z tÀs In addition, a new seasonal difference operator is set out as:

Similar autoregressive and moving average polynomials, now in B s , seasonal backshift operator, can be defined as F P (B s ) and Q Q (B s ), respectively.

The final model, including both seasonal and nonseasonal patterns is represented as the ARIMA(p,d,q) (P,D,Q) s model:

The forecasting expressions of ARIMA models may have three alternative forms [23] . The first one is simply the development of Eq. (35.3). For instance, for an ARIMA(1,1,1) model, the expression will be:

In this expression, z i (i < t) are the past values of the time series, and a i (i < t) are the past model residuals. For projecting the forecasts, values of z i (i > t) are substituted by the forecasted values, and a i (i > t) are replaced by zero.

The second and third forecasts expressions are the result of moving all polynomials in B to the same term of the equation. If polynomials are moved to the left side, we refer to this expression as the p weights form:

And as the result of dividing polynomials is another polynomial, in general of infinite order

And from here:

The decision of truncating this expression can be based on the decreasing values of successive p coefficients.

Similarly, if polynomials are moved to the right side, we refer to this expression as the J weights form:

And we can obtain z t ¼ a t À J 1 a tÀ1 À J 2 a tÀ2 À J 3 a tÀ3 À . À J i a tÀi À .

Also here, the decision of truncating this expression can be based on the decreasing values of successive J coefficients.

The cross-correlation function (CCF) is a way of analyzing the similarity between two time series as a function of the displacement in time of one related to the other. The CCF is built with the cross-correlation coefficients, that is, the correlation coefficients between the elements in the two series analyzed, at different time displacements [25, 26] .

If we consider two time series z t and x t , the cross-correlation coefficient for delay (time displacement or lag) L, represented as r(L), is

The CCF is a graphical representation of the cross-correlation coefficients for delays in an interval [ÀK,þK]. By varying L above and below zero, we can obtain an image of the possible interaction between the two series, identifying the time displacements at which this influence is higher. For example, if only one significant coefficient appears at L ¼ 0, this means that the interaction between the two series is instantaneous. If a significant cross-correlation coefficient appears in L > 0, this means that changes in x t affect z t with a delay of L periods. If a significant cross-correlation coefficient appears in L < 0, this means that changes in z t affect x t with a delay of L periods.

We used the public data provided by the Spanish Ministry of Health for our study, accessible daily via web [27] and tabulated in Ref. [28] . The dataset selected included six series, corresponding to the total number of infected people detected, the cumulative figures for recovered patients, the cumulative figures for deaths, the number of patients in ICUs, and the number of hospitalized patients. In the period under study, the criteria for gathering data were maintained constant.

In Fig. 35 .1, we can appreciate the evolution of the different variables included in our dataset. The different curves show the cumulative values evolution. We can appreciate changes in the speed of growth, but it is better for recognize the evolution to use daily values instead of cumulative figures ( Fig. 35.2 ).

The number of infected people was highly dependent on the number of tests performed and was therefore discarded for our main forecasting study. The most consistent series out of the given data was the cumulative death figures, and this became the main target of this study. The number of patients in ICUs was also studied, given its key relevance to show the stress placed on the Spanish healthcare system. The series with the number of recovered patients was studied in its relationship with the infected series, and only the number of hospitalized patients was not considered in our analysis. Chapter 35 The growth of COVID-19 in Spain. A view based on time-series 649 -The third period, starting around April 4 means the beginning of the decreasing zone of the disease evolution, when the impact of Government actions finally changes the shape of the pandemic curve.

The shape of the first of these three periods, with nearly exponential growth, can be appreciated with more detail in Figs. 35. 3 and 35.4. Data are included in Annex A, at the end of this chapter. We have included not only the series used in this chapter but also the complete file for the period considered.

For the analysis, we considered the daily death figures. Though data about COVID-19 became available on February 20, the first deaths were recorded on March 4. Our analysis started on March 25, as a certain amount of data was required for significant analysis.

As literature indicates that epidemic processes frequently follow an exponential pattern in their growth phase (Chowell et al. [29, 30] ), we decided to analyze not the original data but to study their logarithms instead. The structure of the adjusted model and forecast was in line with epidemiological literature. The model was re-evaluated with each new piece of data, starting from zero, that is, starting from the model identification. This is the reason for the changes in the model used, given that these were the best models for each day.

The model was consistently an ARIMA model with two differences, autoregressive and a MA order of 1 or 2, thus varying between ARIMA(2,2,1) and ARIMA(1,2,2), except on March 30 where an ARIMA(0,2,5) fitted slightly better than the usual models.

In the first part of the period, the main aim was to forecast the peak in the death curve.

The Spanish Government enforced restrictions on people's movements on March 14 and given an incubation period of 7e14 days for the virus, the impact of these restrictions only became apparent one to two weeks later. Thus, the evolution before March 22 may be considered as the evolution pattern previous to restrictive measures.

The date for the maximum number of deaths and the forecast value for these deaths began to be brought forward as of March 25, 10 days after the confinement measures were brought in and when the death toll series has a length of at least 20 time periods (first dead was registered in March 4), a minimum to being able to perform a reasonably consistent analysis. On March 25, after an especially bad figure of persons death this day, the peak was expected on April 14 with a daily death toll of more than 5000. By the end of March, this date had been brought forward to April 3, with a death toll of about 1000. Finally, the maximum number of deaths occurred on April 2, with the terrible figure of These values require some comment, given that the forecasting horizon to identify this peak is higher than the reasonable forecasting horizon. Predicting the evolution 10 days ahead (like on March 19) or even 24 days ahead (like on March 21) was more a mere mathematical projection than a reliable forecast. This was especially true for the first part of the series, as the quality of the forecast depends on the length of the time series. The identification of the curve's peak after March 25, with horizons of less than 10 days, began to offer more reliable results, as the series became longer, and the forecast horizon grew shorter.

Once the confinement started to generate the desired effect, the model stabilized in an ARIMA(2,2,1)/ARIMA(1,2,2) for at least 10 days, before starting to change again in the downward part of the death toll curve.

The shape of the forecast evolution is shown in Fig. 35 .5, corresponding to different dates between March 20 and April 7 (March 20, 22, 26, 28, and 31, and April 2, 4, and 7). First figures correspond to the period before the death curve peaked and the last ones after this maximum figure.

Centering the analysis on short term forecasts, those more reliable, we have reviewed the accuracy of three days ahead values. One day ahead forecasts always included the real value in the 95% confidence interval. In fact, the mean number of absolute sigma (as a distance between real data and forecasted value) was 0.35 and always less than 1. The same was valid for two days ahead forecasts (mean of absolute sigma between forecast and actual value of 0.32, always less than 1) and for three days ahead forecasts (mean of absolute sigma between forecast and actual value of 0.36, always less than 1). Model parameters values changed daily, sometimes by the change in the ARIMA orders but also when orders maintain the same values, the incorporation of new data to the series produces different parameters' estimations.

As an example, two daily forecasts functions will be presented, for March 20 and for April 2. In March 20 (see Table 35 .1), selected model was an ARIMA(1,2,2), and thus the polynomial expression was:

Once developed, the forecasting direct expression is: z t ¼ ð2 þ 4 1 Þz tÀ1 À ð1 þ 24 1 Þz tÀ2 þ 4 1 z tÀ3 þ a t À q 1 a tÀ1 À q 2 a tÀ2 þ Constant And using the parameter estimates (4 1 ¼ À0.58698; q 1 ¼ 1.84775; q 2 ¼ À0.88044, constant ¼ À0.02104 (not significant)) z t ¼ 1.41302 z tÀ1 þ 0.17396 z tÀ2 À 0.058698 z tÀ3 þa t À 1.84775a tÀ1 þ 0.88044 a tÀ2 À 0.02104

In April 2, the best model was an ARIMA(2,2,1), and the corresponding expressions are:

And taking into account the parameter estimates (4 1 ¼ À1.11585; 4 2 ¼ À0.473501; q 1 ¼ 1.00713, constant ¼ À0.0382876 (not significant)) z t ¼ 0.88415 z tÀ1 À 2.705201 z tÀ2 À 0.168848 z tÀ3 À 0.473501 z tÀ4 þa t À 1:00713 a tÀ1 À 0.0382876

Patients with more serious problems and more complicated prognoses were taken into the ICUs, which became the real bottleneck, saturating hospitals in Spain, and in all countries [31, 32] . There is obviously a link between the series of deaths and that of ICU admissions. ARIMA analysis brings us a tool to verify this relationship and to understand how ICU figures affect death toll values. To do this, we analyzed the CCF between the ARIMA residuals from the death toll adjustment and the ICU admissions series. The CCF enables us to detect the possible delay of the impact of one series on the other. In our case, the cause-effect relationship is from ICUs to deaths, and in the CCF computation, we should detect important values for some cross-correlation coefficients in negative delays. Fig. 35.6 shows the CCF obtained on April 7. We selected the last day of the period under study to perform our analysis so as to take advantage of the maximum amount of data, but in previous dates, the shape was similar.

As Fig. 35.6 shows, there is a clear maximum in CC coefficients for delay À8. Symmetric coefficients around delay À8 are called usually "satellites" and are an analytical effect of the basic relation in the main delay.

The meaning of this pattern has an important interpretation: ICU admissions had an impact on the number of deaths with a delay of eight days. More precisely, the part of the evolution of the death toll values that cannot be explained by its own history, is, at least in part, explained by the number of ICU admissions eight days before.

There should be a relationship between the figures of infected people and those of recovered people. Causal models as SIR or SEIR formulates this relationship in their defining expressions [33] . We will evaluate this using CCF from the ARIMA perspective. If we compare cumulative curves, the relationship will show a high correlation coefficient, but this can be a spurious value (when computing the relationship between any two growing series, the result will be a high-correlation coefficient). Nevertheless, by logical and medical reason, a causal relation should exist and requires some analysis.

We have used the series of daily new infected persons and daily recovered persons. CCF has been used to evaluate how both series interact. To do this, first an automated forecasting model selection has been made for the evolution of daily recovered series, resulting in a weak structure known as random walk, or in terms of ARIMA models an ARIMA(0,1,0) model. CCF has been obtained then for the residuals of this model vs the series of daily infected. We have used the same time window, and the result shows that today's recovered values have a significative dependence on values of infected cases with delays 2 and 8 (with a 95% level of confidence). Fig. 35 .7 shows this CCF. 

The period considered for this study is especially relevant because it corresponds to the period in which the Spanish healthcare system moved from a situation of normality to one of maximum stress, on the verge of saturation and sometimes even beyond this boundary. After the peak in the pandemic curve, the healthcare system still had a lot of work, but the stress started to decrease, and the situation began to return to normal.

As we have seen, ARIMA models were able to adjust the evolution pattern of the COVID-19 growth phase for the Spanish case. Models with usually no more than three parameters were capable of generating accurate forecasts with a mean error of about 0.33 sigma, for one, two and three days ahead forecasts.

The use of the CCF enabled us to detect an interesting relationship between the number of ICU admissions and the number of deaths eight days later, identifying maybe a critical phase in the time spent in ICUs.

Also using CCF has showed that there is a significant relationship between recovered persons and infected cases with 2 and 8 days of delay, maybe linked with less severe cases and medium sever cases, as for severe cases the length of stay in hospital is higher, ever more than one month. These severe cases are not detected by the CCF because the length of the series involved in the analysis.

The analysis has continued after the period under study in this paper. The situation has changed in the downward phase of the pandemic's evolution in Spain. On one hand, there have been certain issues with data gathering, as data are recorded by the regional governments and then sent to the central government, and this process has suffered some discrepancies. In addition, the structure of the model has changed and will be reviewed in later research papers. 

A novel coronavirus from patients with pneumonia in China

The SARS-CoV-2 outbreak: what we know

Insight into 2019 novel coronavirus d an updated interim review and lessons from SARS-CoV and MERS-CoV

Preliminary bioinformatics studies on the design of a synthetic vaccine and a preventative peptidomimetic antagonist against the SARS-CoV-2 (2019-nCoV, COVID-19) coronavirus

The COVID-19 vaccine development landscape

From SARS-CoV to SARS-CoV-2: safety and broad-spectrum are important for coronavirus vaccine development

A search for medications to treat COVID-19 via in silico molecular docking models of the SARS-CoV-2 spike glycoprotein and 3CL protease

Ongoing clinical trials for the management of the COVID-19 pandemic

Breakthrough: chloroquine phosphate has shown apparent efficacy in treatment of COVID-19 associated pneumonia in clinical studies

Features, Evaluation and Treatment Coronavirus (COVID-19)

Recommendations on the clinical management of the COVID-19 infection by the «new coronavirus» SARS-CoV2

Candidate drugs against SARS-CoV-2 and COVID-19

A spatio-temporal analysis for exploring the effect of temperature on COVID-19 early evolution in Spain

SutteARIMA: short-term forecasting method, a case: covid-19 and stock market in Spain

Monitoring transmissibility and mortality of COVID-19 in Europe

Application of the ARIMA model on the COVID-2019 epidemic dataset

COVID-19): ARIMA Based Time-Series Analysis to Forecast Near Future

A review of modern technologies for tackling COVID-19 pandemic

Evaluating forecasting methods

Causal diagrams in systems epidemiology

Estimation of COVID-19 dynamics "on a back-of-envelope": does the simplest SIR model provide quantitative parameters and predictions?

Time Series Analysis: Forecasting and Control

Introduction to Time Series and Forecasting

Time Series Analysis with Applications in R

Checking the independence of two covariance-stationary time series: a univariate residual cross-correlation approach

Time series regression studies in environmental epidemiology

The western Africa ebola virus disease epidemic exhibits both global exponential and local polynomial growth rates

Mathematical models to characterize early epidemic growth: a review

Coronavirus Disease 2019 (COVID-19) Pandemic: Increased Transmission in the EU/EEA and the UK e Seventh Update

Intensive care management of coronavirus disease 2019 (COVID-19): challenges and recommendations

Numerical study of SARS epidemic model with the inclusion of diffusion in the system