1 Introduction

The COVID-19 is an infectious disease that manifests itself in humans infected by the SARS-CoV-2 virus. The acronym SARS stands for Severe Acute Respiratory Syndrome, which belongs to the family of Coronaviruses who have been infecting humans for a long time. On December 31, 2019, the World Health Organization (WHO) alerted to a new strain (type) of coronavirus. On January 7, 2020, Chinese authorities confirmed the identification of a new type of coronavirus, that was initially named 2019-nCoV, and on February 11, 2020, it was given the name SARS-CoV-2 [15, 28].

Those infected transmit the virus through the mouth or nose, by expelling small particles when they cough, sneeze, talk, sing or breathe. Infection occurs by inhaling these particles through the airways or by touching a contaminated surface and then coming into contact with the eyes, nose or mouth [26]. Due to the nature of the virus, it spreads more easily indoors and in crowds. Those who become ill as a result of COVID-19 experience symptoms ranging from mild and moderate, where the patient recovers without special treatment, to severe, where the patient needs specialized medical care [2].

To this date (May 06, 2023), the virus that causes COVID-19 has infected more than 270 million people and caused more than 5.3 million deaths worldwide [5, 19]. In Brazil, the numbers reach more than 37 million infected and 701 thousand dead [1]. So far, the world has paid a heavy price for this pandemic in lost human lives, economic repercussions and increased poverty. The COVID-19 pandemic is still ongoing, and the virus continues to mutate, generating variants.

This work presents time series forecasting models to predict the number of COVID-19 cases in Brazilian cities [4], covering the period from February 2020 to December 2022. We aim to improve the accuracy of the predictions by incorporating Graph Convolutional Networks (GCN) models, specifically GCRN and GCLSTM. These models leverage a mobility network that captures the commuting patterns between cities, allowing us to account for the interdependencies among cities’ time series and the influence of human mobility on disease transmission [7,8,9, 20].

The performance of our proposed GCN models is compared to two well-known forecasting methods, LSTM and Prophet. Through extensive experiments and evaluation, we obtained compelling results that highlight the effectiveness of the GCN models approach in capturing the complex dynamics of COVID-19 transmission and improving the accuracy of predictions.

This research contributes to the field of COVID-19 forecasting by demonstrating the value of incorporating mobility networks and considering the impact of human movement on disease spread. Our findings provide insights into the interplay between mobility patterns and the spread of COVID-19, and offer valuable implications for developing effective strategies to mitigate the impact of the disease.

2 Related Works

Xi et al. (2022) [27] simulate the propagation curve of COVID-19 in a university, based on the interaction between students through Graph Neural Network (GNN) and a SIR model (Susceptible - Infected - Recovered) [23]. They simulated three distancing strategies designed to keep the infection curve flat and help make the spread of COVID-19 controllable. Students and professors were considered as nodes and places or events as layers (face-to-face lectures, internal social activities, campus cafeterias, etc.). The GNN was used to validate and verify the effectiveness through the simulation of the infection curve, where the infected were randomly chosen. They provide a visualization of the dissemination process of COVID-19 and the comparison between the results of the forecast of dissemination for each of the control strategies.

With the aim of predicting the dynamics of the spread of the pandemic in the USA, the work carried out by Davahli et al. (2021) [6] developed two predictive models using GNN, fed by real data, obtaining results from the forecast of spread and a comparison between the models. The first model developed was based on graph theory neural networks (GTNN) and the second was based on neighborhood neural networks (NGNN). Each US state is considered a graph node, while the edges of the GTNN model document the functional connectivity between states, those of the NGNN model connect neighboring states with each other. The performance of these models was evaluated with existing real dataset that reflect conditions from January 22 to November 26, 2020 (prior to the start of COVID-19 vaccination in the United States). The results indicate that the GTNN model outperforms the NGNN model.

Malki et al. (2020) [17] estimate the spread of COVID-19 infection in several countries, using a supervised machine learning model developed based on a decision tree algorithm and linear regression, for time series forecasting. They show an \(R^2\) of 0.99 in the prediction of confirmed cases at the global level.

Vaishya et al. (2020) [25] deal with the application of artificial intelligence (AI) in combating the COVID-19 pandemic. The authors presented several possibilities for the use of AI in different areas, such as diagnosis, triage, patient monitoring, forecasting disease trends, drug and vaccine development, and even to ensure compliance with social distancing measures. They emphasize the importance of international collaboration to ensure access to the data needed to train AI algorithms and highlight that technology can play a key role in the fight against the pandemic.

The related works presented in this section are all focused on predicting the spread of COVID-19 using different approaches. Xie et al. (2022) used a Graph Neural Network (GNN) and SIR model to simulate the propagation curve of COVID-19 in a university, while Davahli et al. (2021) used GNN to predict the dynamics of the spread of the pandemic in the USA. Malki et al. (2020) used a decision tree algorithm and linear regression for time series forecasting, and Vaishya et al. (2020) discussed the application of AI in combating the pandemic. In contrast, our study focuses on incorporating a mobility network that describes the commuting patterns between Brazilian cities into the forecasting task, leveraging Graph Convolutional Network (GCN) models. The main hypothesis is that this approach could improve the forecasting task by accounting for the dependencies among cities’ time series. Therefore, our work contributes to the literature by emphasizing the importance of incorporating human mobility data into the forecasting task and highlighting the potential of GCN models for this purpose.

3 Methodology

3.1 Mathematical Preliminaries

\(G=(V,E)\) is a graph, in which V is the set of nodes and E the set of edges, resulting in \(N=|V|\) vertices and |E| edges. The binary (or weighted) adjacency matrix \(A \in \mathbb {R}^{N \times N}\) depicts the graph connections; \(D_{ii} = \sum _j A_{ij}\) is the degree matrix, with zeros outside the main diagonal; and \(X \in \mathbb {R}^{N \times C}\) is the feature vector of each vertex, where C is the number of vector dimensions.

3.2 GCN

The Graph Neural Networks (GNN) models allow us to extend the methods of neural networks, in order to process the data represented in the graph domain [21]. GNN models can process most graphs being some of them acyclic, cyclic, directed and undirected, or implement a particular function. The GCN proposed by [18] applies a convolution operation to each vertex of the graph. The approach has three steps: (i) Select a vertex, (ii) build a subgraph of its neighborhood and finally (iii) normalize the selected subgraph with the edge weights of neighboring vertices. Repeat the process for each vertex \(h^{k}\), building the matrix \(H^{k}\) that contains all the vertices in the layer k (refer to Fig. 2 and Eq. (1)).

The subgraph is built using the adjacency matrix A and the diagonal degree matrix of the vertices D. The size of the subgraph is delimited by the parameter K (see Fig. 1). When \(K=1\) the vertex is not subjected to the influence of neighboring vertices. In the case of \(K=2\), first-degree vertices (direct neighbors) are considered, directly connected to the active vertex. For a \(K=3\), second-degree vertices are also considered, connected to first-degree vertices. The subgraph is normalized by the weight matrix.

Fig. 1.
figure 1

Nearest neighbors parameter (K) in a Graph. In the purple color, \(K=1\) represents nodes without any neighbors. In the orange color, \(K=2\) represents nodes with first-degree neighbors, and in the red color, \(K=3\) represents nodes with second-degree neighbors. (Color figure online)

When considering the subgraph of a node, we calculate the average feature vector by incorporating information from its neighbors. This process assigns greater importance to the less connected neighbors (lower degree) while also giving higher weight to the node’s connections. Let \(\widehat{A} = A + I_N\), A being the weighted adjacency matrix of the subgraph, \(I_N\) being the identity matrix of order N, and \(\widehat{D}\) is the diagonal degree matrix of \(\widehat{A}\). The output of the convolution layer \(k+1\) is as follows:

$$\begin{aligned} H^{k+1} = \sigma (\widehat{D}^{-1/2}\widehat{A}\widehat{D}^{-1/2}H^{k}W^{k+1}), \end{aligned}$$
(1)

where W is the matrix of trainable parameters, and \(\sigma \) is the activation function.

In summary, each vertex in the graph is assigned a feature vector and convolution is applied at each node to compute a new feature vector that considers the neighborhood of the node.

3.3 GCN Models for Time Series Prediction

The Graph Convolutional Recurrent Networks (GCRN) [22] and Graph Convolutional LSTM (GCLSTM) [3] models are architectures based on GCN models that are designed with graph processing and machine learning techniques for the purpose of prediction of data in time series. The convolutions in both architectures differ from a conventional convolution using the relationship between neighbors at each vertex of the graph.

To predict a future sequence, of length F, we use a sequence, of length L (lags), of observed data at previous timestamps:

$$\begin{aligned} \hat{x}_{t+1}, ... , \hat{x}_{t+F} = \genfrac{}{}{0.0pt}{}{arg max}{x_{t+1}, ... , x_{t+F} } P(x_{t+1}, ... , x_{t+F} | x_{t-L+1}, ... , x_{t}) \end{aligned}$$
(2)

where \(x_{t}\) is a snapshot at time t of the features and \(\hat{x}_{t}\) is the prediction (see Fig. 2).

Hence, \(x_{t}\) represents a value of the time series at time t for all nodes, defined on a directed, weighted static graph. On the other hand, a snapshot X can be interpreted as a matrix \(X \in \mathbb {R}^{N \times C}\), which captures the temporal evolution of the time series over the interval of C observations.

Fig. 2.
figure 2

Spatial and Temporal Snapshot Data Flow and Model Architecture: Processing GCNs with GCLSTM and GCRN Recurrences

The GCRN and GCLSTM models use the same design but with their own convolution blocks: in Fig. 2, the GCRN model uses GConvGRU blocks whereas the GCLSTM model uses GConvLSTM blocks. Both use deep learning techniques, but GCRN uses recurrent networks and GCLSTM uses LSTM cells, in order to capture patterns in time series data considering the influence of its neighbors through the graph.

The graph generated in this work has the spatial structure of the Brazilian mobility network (refer to Sect. 3.5), which has static edges, that is, fixed attributes (see Fig. 2).

3.4 Prophet and LSTM

LSTM (Long Short Term Memory) [12] and Prophet [24] are time series forecasting models, and each one has a different approach. In the LSTM model it resembles an RNN (Recurrent Neural Network) composed of 4 memory cells: Input, Forget, Cell, Hidden. These memory cells are responsible for retaining, forgetting or remembering selected information. LSTM is commonly used on data that exhibit complex patterns, sudden changes, trends, or seasonality.

The Prophet model, is based on an additive model that, through regression, models the linear or non-linear trend, in terms of independent variables such as trend, seasonality and holidays, components of the time series. The adjustments of these components are performed using an optimization algorithm such as MCMC [11] (Markov Chain Monte Carlo) or MAP [10] (Maximum A Posteriori) that seeks to find the values of the parameters that maximize the posterior probability of the data, considering the uncertainty of the parameters when making predictions.

3.5 Datasets: COVID-19 Cases and the Brazilian Mobility Network

Cota (2020) [4] provides a public dataset of COVID-19 records in Brazil. Among other data, it contains the numbers of infected and deaths related to municipal and federative units level in Brazil with official information from the Ministry of Health. The dataset is daily updated from official sources, providing reliable information. It also includes information on the cities’ locations, which is useful for our mobility network.

Knowing that the spread of diseases like COVID-19 occurs from infected to susceptible people, one factor that may contribute to the construction of prediction models is to know the flow of people between cities. We thus use the dataset provided by the Brazilian Institute of Geography and Statistics (IBGE) [16] to build a mobility network whose weighted connections represent the number of vehicles from one city to another in an ordinary week. The connection identifies the cities directly accessible from a source, considering (1) the number of outgoing vehicles (frequency), (2) the type of transport (road, waterways, or alternative/informal), (3) travel time, and (4) cost of the traveling. In this study, the weight of each edge was represented solely by attribute 1. This weight was determined by the aggregation of frequencies associated with each mode of transportation (denoted as attribute 2) present on a given edge. It should be noted that attributes 3 and 4 were not incorporated into the calculations in this analysis. The dataset has \(N = 5385\) vertices (cities) and \(L = 65639\) edges (routes between the cities), as shown in Fig. 3. The subfigure on the right side gives the covered territory, in green, and the one on the left side presents the corresponding network.

Fig. 3.
figure 3

Brazilian mobility network: a) Nodes representing cities and edges as the routes between them; b) The covered territory, in green. White areas inside the Brazilian map are cities without reported cases in our COVID-19 dataset. (Color figure online)

3.6 Experimental Setup

The COVID-19 dataset has been structured to train models under four distinct scenarios: i) utilizing only the data from 2020; ii) using the 2021 data exclusively; iii) relying solely on the 2022 data; and finally, iv) a comprehensive version incorporating all years. The data undergoes standardization using Z-score normalization (Eq. (3)) to enable consistent and meaningful analysis.

$$\begin{aligned} z = \frac{(x -\mu )}{\sigma } \end{aligned}$$
(3)

Here, x represents the value of the element, \(\mu \) is the population mean, and \(\sigma \) is the population standard deviation. This normalization process ensures uniformity in the data and facilitates reliable comparisons. In each of the four scenarios, the dataset is partitioned into an 80% subset for training and a 20% subset designated for testing. Each city has 1,009 individual readings of COVID-19 case numbers, with each reading representing a single day’s data.

These readings are grouped into sequences of 14 consecutive days (lags), which are utilized to predict the data for the 15th day. Consequently, each snapshot corresponds to a time series comprised of non-overlapping samples, each size 14. The subsequent day, the 15th, is the target prediction and assumes the first position in the succeeding time series (see Fig. 4). The entire 1009-day period, corresponding to February 2020 to November 2022, results in approximately 57 training snapshots and 15 test snapshots. Lauer et al. [14] provide empirical evidence supporting the recommendation of a 14-day active monitoring period by the US Centers for Disease Control and Prevention. Their study reinforces the importance of this time frame in effectively monitoring and controlling the spread of COVID-19. Regarding the training of the models is parameterized for a total of 100 epochs for the GCRN, GCLSTM, and LSTM models. In these models, an Adam optimizer [13] is used, which adjusts the model’s weights during training, minimizing the Mean Square Error Loss.

Fig. 4.
figure 4

Mobility Network Map and Spatial Time Series Data in Brazil: Input Dataset Overview

As Adam is a stochastic gradient descent optimizer, it has an element of randomness, so we perform 10 rounds of experiments for each model. The Adam optimizer was chosen because it uses two moments to update the model weights: the gradient moment and the quadratic gradient moment [13]. At the gradient moment, it adapts the learning rate for each parameter based on how often the parameter is updated. Not all cities update cases daily, so the data does not have a unified update frequency. In the quadratic gradient moment, the weighting of the learning rate is performed by the moving average of the squares of the gradients of each parameter individually, adjusting the model weights.

To present and compare the results of the tests between the neural networks, we use the Root Mean Square Error (RMSE) and the \(R^2\) score (Coefficient of Determination). The RMSE,

$$\begin{aligned} RMSE = sqrt(\frac{1}{n}\sum _{i=1}^{n}(y_i-\hat{y}_i)^2), \end{aligned}$$
(4)

is a measure of dispersion that calculates the square root of mean square errors between predictions and actual values, i.e., it indicates how well the model’s predictions fit the data. The lower the RMSE, the better the model performs against the test data.

The \(R^2\) score,

$$\begin{aligned} R^2=1-\frac{\sum ({y_i}-\hat{y_i})^2}{\sum (y_i-\bar{y})^2}, \end{aligned}$$
(5)

indicates how much the independent variables explain the variation of the dependent variable. It belongs to the interval [0, 1], where 1 indicates that all variations in the dependent variable are explained by the independent variables, and it is 0 when there is no relationship between the variables. Therefore, the closer to 1, the better the model.

4 Results

The Tables 1 and 2 have references to those years, which we will comment on. As we train GCRN, GCLSTM, and LSTM models 10 times each, the results are presented in terms of mean, max, min, and std values of RMSE and \(R^2\).

Table 1. GCRN model results in terms of RMSE and \(R^2\) score

4.1 GCRN

The results of GCRN (see Table 1) in 2020, the first year of the COVID-19 pandemic, are: \(RMSE = 1495.18\), while in 2021 and 2022, the values decreased significantly to 938.39 and 819.62, respectively. In the last two years, the mean value error became smaller, but the number of cases continued to grow.

The highest min error value was recorded in 2020 and the lowest min error value was recorded in 2021, with values of 1364.51 and 643.26, respectively, while in 2022 the max error value decreased to 1034.46 even with a greater variation than in the previous years. Note that the standard deviation in the years 2020 and 2022 are close, and in the year 2021 it doubles in relation to them. These factors are indicators that in the year 2021 there was a change in the dynamics of spreading the disease.

The highest RMSE for the three years combined suggests potential variations in the dynamics of COVID-19 spread over time, which could be attributed to various factors influencing the transmission patterns. These factors may include governmental interventions, changes in population behavior, the emergence of new variants, variations in testing strategies, healthcare capacity, public adherence to preventive measures, and other external factors that could impact the transmission dynamics of the virus.

Regarding the \(R^2\) score, the models showed low variation in the standard deviation and a minimum value of 0.96 for 2020, which still represents a high prediction capacity of the model.

Figure 5b gives a visualization of the model predictions on the training set, with most points laying on the main diagonal as the high \(R^2\) values suggest. Figure 5a presents a forecasting example, for November 20, with the model trained with all years. We highlight Sao Paulo, Manaus, and Brasilia as being state capitals in different regions of Brazil.

Fig. 5.
figure 5

GCRN trained with the dataset of all years (2020 to 2022): a) Result of the predicted and true number of cases in each city on November 29, 2022. The true values of Manaus, Sao Paulo and Brasilia are highlighted with red crosses; b) Linear regression between true and predicted values. (Color figure online)

4.2 GCLSTM

Table 2 introduces the results for the GCLSTM model. As in the GCRN: the average RMSE for 2020 is the highest compared to 2021 and 2022; there is a decreasing trend from 2020 to 2022; and the error for all years is the largest. The \(R^2\) score also decreases over the years and is better than the one obtained for 2020.

Table 2. GCLSTM model results in terms of RMSE and \(R^2\) score

The experiment is repeated, but for the GCLSTM model, the results are shown for the \(R^2\) score metric for Table 2 in Fig. 6b and also the results of prediction values compared to actual values for each city (see Fig. 6a).

Fig. 6.
figure 6

GCLSTM trained with the dataset of all years (2020 to 2022): a) Result of the predicted and true number of cases in each city on November 29, 2022. The true values of Manaus, Sao Paulo, and Brasilia are highlighted with red crosses; b) Linear regression between true and predicted values. (Color figure online)

4.3 Comparing Forecast Models Results

The lowest RMSE values are highlighted on Table 3, referring to the dataset of all years (from 2020 and 2022). Note that the GCRN and GCLSTM models present similar results concerning mean, max, min, and standard deviation. Interestingly, they achieved lower values of standard deviation, which shows their consistency.

The Prophet model stands out in the average and at the minimum, where it has the lowest values among the models. However, its standard deviation is 3 times greater than the lowest value found in the GCLSTM model and the maximum RMSE is 14 times greater than the maximum observed for GCRN. The model is incredibly good for most data points, but really bad in some cases, showing less stability than GCRN and GCLSTM.

The LSTM model did not obtain any statistics highlighted as the best and can be considered the model with the worst performance, although its minimum RMSE is the second best among all.

Table 3. RMSE values of the forecast models of COVID-19 cases in Brazil.

The results show that the mobility network indeed helps to forecast COVID-19 cases when embedded in machine learning models. LSTM alone presents the worst RMSE of all models, but when it is integrated with the GCN, via the GCLSTM, the predictions improve. Although Prophet gives the best mean RMSE, it fluctuates more than the graph-based models, showing that sometimes it outperforms them, but is not as stable.

5 Discussions and Conclusions

This studyFootnote 1 investigates the application of GCNs for COVID-19 spread in Brazilian cities. Combining graphs with temporal forecasting models offers advantages. Integrating mobility networks, particularly GCNs with LSTM through GCLSTM, improves prediction accuracy over LSTM alone.

Among the models evaluated, LSTM performed the worst, with Prophet showing the best mean RMSE but higher fluctuations and less stability compared to the graph-based models (GCRN and GCLSTM). The GCN models achieved high \(R^2\) score values above 0.95, representing the data with at least 95% certainty, the models also demonstrated similar results in terms of mean, maximum, minimum, and standard deviation of RMSE. Notably, they achieved lower standard deviation values, indicating higher consistency in their predictions.

In conclusion, the integration of mobility networks through GCNs in machine learning models has proven effective in forecasting COVID-19 cases. The GCN models (GCRN and GCLSTM) showed improved stability and consistency compared to Prophet, while LSTM alone exhibited the worst performance. The models’ performance highlights the importance of considering spatial and temporal data in forecasting tasks, providing valuable insights for epidemiological surveillance and informing strategies to combat COVID-19.

One limitation of our analysis is that it only focuses on one-day predictions. However, it is crucial to extend the prediction window to longer time frames, considering the virus’s incubation and transmission periods, as well as travel time between locations.

The strength of incorporating mobility data lies in our approach using GCNs, which enables us to capture the connectivity between cities and integrate changes in neighboring time series into the local predictions of each city. These changes in neighboring series have a time delay due to the incubation and transmission factors of the virus. By considering these factors and exploring longer prediction windows, we anticipate achieving more significant results in leveraging mobility networks for COVID-19 case prediction.

For future work, we will address the suggestion of conducting more in-depth analyses to assess the impact of including spatial data and applying GCNs for longer-term predictions. Additionally, we will investigate the convergence of the training process and explore ways to enhance the interpretability and explainability of the temporal and spatial data by the GCN models, particularly by examining the stabilization of the loss after 100 epochs, so that we can gain deeper insights into the underlying mechanisms driving the predictions of GCN models and enhance our understanding of the complex interplay between temporal and spatial factors in the spread of COVID-19.

Also, considering that GCN-based models may require more training time (epochs) compared to Prophet, further investigation into the impact of training duration on model performance could be valuable.

Lastly, exploring unexplored graph network regions is a promising research avenue. It can reveal COVID-19 spread insights and uncover hidden data patterns or relationships.