key: cord-188465-wwi8uydi authors: Spadon, Gabriel; Hong, Shenda; Brandoli, Bruno; Matwin, Stan; Rodrigues-Jr, Jose F.; Sun, Jimeng title: Pay Attention to Evolution: Time Series Forecasting with Deep Graph-Evolution Learning date: 2020-08-28 journal: nan DOI: nan sha: doc_id: 188465 cord_uid: wwi8uydi Time-series forecasting is one of the most active research topics in predictive analysis. A still open gap in that literature is that statistical and ensemble learning approaches systematically present lower predictive performance than deep learning methods as they generally disregard the data sequence aspect entangled with multivariate data represented in more than one time series. Conversely, this work presents a novel neural network architecture for time-series forecasting that combines the power of graph evolution with deep recurrent learning on distinct data distributions; we named our method Recurrent Graph Evolution Neural Network (ReGENN). The idea is to infer multiple multivariate relationships between co-occurring time-series by assuming that the temporal data depends not only on inner variables and intra-temporal relationships (i.e., observations from itself) but also on outer variables and inter-temporal relationships (i.e., observations from other-selves). An extensive set of experiments was conducted comparing ReGENN with dozens of ensemble methods and classical statistical ones, showing sound improvement of up to 64.87% over the competing algorithms. Furthermore, we present an analysis of the intermediate weights arising from ReGENN, showing that by looking at inter and intra-temporal relationships simultaneously, time-series forecasting is majorly improved if paying attention to how multiple multivariate data synchronously evolve. Time series refers to the persistent recording of a phenomenon along time, a continuous and intermittent unfolding of chronological events subdivided into past, present, and future. In the last decades, time series analysis has been vital to predict dynamical phenomena on a wide range of applications, varying from climate change [1] - [3] , financial market [4] , [5] , land use monitoring [6] , [7] , anomaly detection [8] , [9] , energy consumption, and price forecasting [10] , besides epidemiology and healthcare studies [11] - [15] . On such applications, an effective data-driven decision frequently requires precise forecasting based on time series [16] . A prime example is the SARS-CoV-2, COVID-19, or Coronavirus Pandemic [17] , which is known for being highly contagious and causing increased pressure on healthcare systems worldwide. In this case, time-series modeling Fig. 1 : An example of a multiple multivariate time-series forecasting problem, where each multivariate time-series (i.e., sample) share the same domain, timestream, and variables. When stacking the time-series together, we assemble a tridimensional tensor with the axes describing samples, timestamps, and variables. The multiple samples have the same variables recorded during the same timestamps, meaning that samples are unique, but every sample is observed in the same way. By tackling the problem altogether, we leverage from inner and outer variables besides intra-and inter-temporal relationships to improve the forecasting. works (GCN) [31] , the later with significant applications on traffic forecasting [32] , [33] . Meanwhile, others used unsupervised auto-encoders on time-series forecasting tasks [34] , [35] . Notwithstanding, all former approaches are bounded to a bidimensional space in which forecasting time-series can be summarized by a non-linear function between time and variables. From a different perspective, we hypothesize that timeseries are not only dependent on their inner variables, which are observations from themselves, but also from outer variables provided by different time series that share the same timestream. For instance, the evolution of a biological species is not related solely to observations from itself, but also from other species that share the same environment, as they are all part of the same food chain. By considering the variables and the dependency aspect during the analysis, the time series gains an increased dimensionality. A previously considered bidimensional problem, in which the forecasting ability of a model comes from observing relationships of variables over time, now becomes tridimensional, where forecasting means understanding the entanglement between variables of different time-series that co-occur in time. Accordingly, time-series define an event that is not a consequence of a single chain of observations, but a set of synchronous observations of many time-series. For example, during the Coronavirus Pandemic, it is paramount to understand the disease's time-aware behavior in every country. Despite progressing in different moments and locations, the underlying mechanisms behind the pandemic are supposed to follow similar (and probably interconnected) patterns. Along these lines, looking individually at the development of the pandemic in each country, one can describe the problem in terms of multiple variables, like the number of confirmed cases, recovered people, and deaths. However, when looking at all countries at once, the problem yields an additional data dimension, and each country becomes a multivariate sample of a broader problem, such as depicted in Fig. 1 . In linguistic terms, we refer to such a problem as a multiple multivariate time-series forecasting. Along with these premises, in this study, we contribute with an unpreceded neural network that emerges from a graph-based time-aware auto-encoder with linear and nonlinear components working in parallel to forecast multiple multivariate time-series simultaneously, named as Recurrent Graph Evolution Neural Network (REGENN). We refer to evolution as the natural progression of a process where the neural network iteratively optimizes a graph representing observations from the past until it reaches an evolved version of itself that generalizes on future data still to be observed. Accordingly, the underlying network structure of REGENN is powered by two Graph Soft Evolution (GSE) layers, a further contribution of this study. The GSE stands for a graph-based learning-representation layer that enhances the encoding and decoding processes by learning a shared graph across different time-series and timestamps. The results we present are based on an extensive set of experiments, in which REGENN surpassed a set of 49 competing algorithms from the fields of deep learning, machine learning, and time-series; among of which are single-target, multioutput, and multitask regression algorithms in addition to univariate and multivariate time-series forecasting algorithms. Aside from surpassing the state-ofthe-art, REGENN remained effective after three rounds of 30 ablation tests through distinct hyperparameters. All experiments were carried out over the SARS-CoV-2, Brazilian Weather, and PhysioNet datasets, detailed in the Methods. In the task of epidemiology modeling on the SARS-CoV-2 dataset, we had improvements of, at least, 64.87%. We outperformed the task of climate forecasting on the Brazilian Weather dataset by at least 11.96% and patient monitoring on intensive care units on the PhysioNet dataset by 7.33%. Furthermore, we analyzed the results by using the cosine similarity on the Evolution Weights from the GSE layers, which are the intermediate hidden adjacency matrices that arise from the graph-evolution process, showing that graphs shed new light on the understanding of non-linear black-box Definition ω ∈ N + Sliding window size w, z ∈ N + Number of training and testing (i.e., stride) timestamps s, t, v ∈ N + Number of samples, timestamps, and variables T ∈ R s×t×v Tensor of multiple multivariate time-series Y ∈ R s×ω×v Batched input of the first GSE and the Autoregression layers Yα ∈ R s×ω×v Output of the first GSE and input of the encoder layers Yε ∈ R s×ω×v Output of the encoder and input of the decoder layers Yε ∈ R s×z×v Output from the first recurrent unit and input to the second one Y ∈ R s×z×v Output of the second recurrent unit and input of the second GSE layer Y ψ ∈ R s×z×v Non-linear output yielded by the second GSE layer Y λ ∈ R s×z×v Linear output provided by the Autoregression layer Y ∈ R s×z×v Final result from the merging of the linear and non-linear outputs G = V, E Graph in which V is the set of nodes and E the set of edges A ∈ R v×v Adjacency matrix of co-occurring variables Aµ ∈ R v×v Adjacency matrix shared between GSE layers A φ ∈ R v×v Evolved adjacency matrix produced by the second GSE layer U • V Batch-wise Hadamard product between matrices U and V U · V Batch-wise scalar product between matrices U and V · F The Frobenius norm of a given vector or matrix ϕ(·) Dropout regularization function σg (·) Sigmoid activation function σ h (·) Hyperbolic tangent activation function cos θ (·) Cosine matrix-similarity RELU(·) Rectified exponential linear unit SOFTMAX(·) Normalized exponential function models. Since higher dimensional time-series forecasting is a topic to be further explored, we understand REGENN has implications in other research areas, like economics, social sciences, and biology, in which different time-series share the same timestream and co-occur in time, mutually influencing one another. In order to present our contributions, this paper is further divided into four sections. We begin by proposing a layer and neural network architecture, besides detailing the methods used along with the study. Subsequently, we display the experimental results compared to previous literature. Next, we provide an overall discussion on our proposal and the achieved results. Finally, we present the conclusions and final remarks. The supplementary material exhibits extended results and additional methods. Hereinafter, we use bold uppercase letters to denote multidimensional matrices (e.g., X), bold lowercase letters to vectors (e.g., x), and calligraphic letters to sets (e.g., X ). Matrices, vectors, and sets can be used with subscripts. For example, the element in the i-th row and j-th columns of a matrix is X ij , the i-th element of a vector is x i and the j-th element of a set is X j . The transposed matrix of X ∈ R m×n is X T ∈ R n×m , and the transposed vector of x ∈ R m×1 is x T ∈ R 1×m , where m and n are arbitrary dimensions; further symbols are defined as needed, but Tab. 1 presents a summary of the notations. Graph Soft Evolution (GSE) stands for a representationlearning layer that, given a training dataset, builds a graph in the form of an adjacency matrix, as in Fig. 2 . The GSE layer receives no graph as input, but a set of multiple multivariate time-series. The graph is built by tracking pairs of co-occurring variables, one sample at a time, and merging the results into a single co-occurrence graph shared among samples and timestamps. We define co-occurring variables as two variables, from a multivariate time-series, with a nonzero value in the same timestamp -in such a case, we say that one variable influences another and is influenced back. The co-occurrence graph is the projection of a tridimensional tensor, T, T ∈ R s×w×v , into a bidimensional one, A, A ∈ R v×v , describing pair-wise time-invariant relationships among variables. The co-occurrence graph A = V, E is symmetric and weighted. It is composed of a set V of |V| nodes equal to the number of variables, and another set E of |E| non-directed edges equal to the number of co-occurring variables. A node v ∈ V corresponds to a variable from the time-series multivariate domain, and an edge e ∈ E is an ordered pair u, v ≡ v, u of co-occurring variables u, v ∈ V. The weight f of the edges corresponds to the summation of the values of the variables u, v ∈ V whenever they co-occur in time, such that f (u, v) = s−1 i=0 w−1 j=0 T i,j,u + T i,j,v . Notice that the whole graph is bounded to w, the number of timestamps existing in the training portion of the input tensor, and if a pair of variables never co-occur in such a subset of data, no edge will be assigned to the graph, which means that u, v ∈ E, and f (u, v) = 0. Given an adjacency matrix A ∈ R v×v , we formulate a GSE layer through the following equations: where W α , W η , W µ ∈ R v×v are the weights and b α , b η , b µ ∈ R v the bias to be learned. In further details, Fig. 2 : Illustration of the Graph Soft Evolution layer's representation-learning, in which the set of multiple multivariate time-series is mapped into adjacency matrices of co-occurring variables. The matrices are element-wise summed to generate a shared graph among samples, which, after a linear transformation, goes through a similarity activation-like function and is scaled by an element-wise multiplication to produce an intermediate adjacency matrix holding the similarity properties inherent to the shared graph. in Eq. 1.1, the layer starts by employing a linear transformation to the shared adjacency matrix, which, after multiple iterations, provides a more generic version of the matrix across the samples. Subsequently, in Eq. 1.2, it uses the cosine similarity on the output of Eq. 1.1, which works as an intermediate activation-like function that provides a similarity index for each pair of variables; see the supplementary material. The resulting matrix goes through an element-wise matrix multiplication to transform it back into an adjacency matrix while holding the similarity properties inherent to the shared graph. Following, in Eq. 1.3, it performs a batchwise matrix-by-matrix multiplication between the adjacency matrix from Eq. 1.2 and the batched input tensor (i.e., Y) so to combine the information from the graph, which generalizes samples and timestamps, with the time-series. The result will be followed by a dropout regularizer [36] and batch-wise matrix-by-matrix multiplication, where the final features from joining both tensors will be produced. The evolution concept comes from the cooperation be-tween two GSE layers, one at the beginning (i.e., right after the input) and the other at the end (i.e., right before the output) of a neural network, such as in the example shown in Fig. 3 . As evolution arises from sharing hidden weights between a pair of non-sequential layers, we named this process after Soft Evolution. Accordingly, the first layer (i.e., source) has the aim to learn the weights that will scale the matrix and produce A µ . Such a result is the input of the second GSE layer (i.e., target), and it will be used for learning the evolved version of the adjacency matrix, referred to as A φ and produced as in Eq. 1.1. Notice that in Fig. 3 , the source layer is different from the target one, as we disregard the regularizer ϕ, trainable weights W α , and bias b α from Eq. 1.3. This is because they aim to enhance the feature-learning processes when multiple layers are stacked together. As the last layer, GSE provides the output from already learned features through a scalar product between the data propagated throughout the network, i.e., Y, and the intermediate evolved adjacency-matrix, i.e., A ψ . Fig. 3 : Graph Soft Evolution layers assembled for evolution-based learning. In such a case, the output of the first GSE layer (i.e., source) will feed further layers of the neural network, whose result goes through the second GSE layer (i.e., target). The GSE, as the last layer, does not use regularizers or linear transformations before the output. Contrarily, it provides the final predictions by the scalar product between the output of the representation-learning process and the data propagated throughout the network. One can see that the source GSE layer has two constant inputs, the graph and input tensor. Differently, the target GSE layer has two dynamic inputs, the shared graph from the source GSE layer and input propagated throughout the network. In the scope of this work, we use an autoencoder in between GSE layers to learn data codings from the output of the source layer, which will be decoded into a representation closest to the expected output and later re-scaled by the target layer. In this sense, while the first layer learns a graph from the training data (i.e., past data) working as a pre-encoding feature-extraction layer, the second layer re-learns (i.e., evolve) a graph at the end of the forecasting process based on future data, working as a postdecoding output-scaling layer. When joining the GSE layers with the auto-encoder, we assemble the Recurrent Graph Evolution Neural Network (REGENN), introduced in detail as follows. REGENN is a graph-based time-aware auto-encoder with linear and non-linear components with parallel data-flows working together to provide future predictions based on observations from the past. The linear component is the autoregression implemented as a feed-forward layer, and the non-linear component is made of an encoder and a decoder module powered by a pair of GSE layers. Fig. 4 shows how these components communicate from the input to the output, and, in the following, we detail their operation. The non-periodical changes and constant progressions of the series across time usually decrease the performance of the network. That is because the scale of the output loses significance compared to the input, which comes from the complexity and non-linear nature of neural networks in tasks of time series forecasting [21] . Following a systematic strategy to deal with such a problem [37] , [38] , REGENN leverages from an Autoregressive (AR) layer working as a linear feedforward shortcut between the input and output, which for a tridimensional input, is algebraically defined as: where W ∈ R ω×z are the weights and b ∈ R z the bias to be learned. The output of the linear component, i.e., Y λ ∈ R s×z×v as in Eq. 2, is element-wise added to the nonlinear component's output, i.e., Y ψ ∈ R s×z×v , so to produce the final predictions of the network Y ∈ R s×z×v , formally given as Y = Y λ + Y ψ . Subsequently, we describe the autoencoder functioning that produces the non-linear output of REGENN. We use a non-conventional Transformer Encoder [25] , which employs self-attention, to learn an encoding from the features forwarded by the GSE layer. The self-attention consists of multiple encoders joined through the scaled dotproduct attention into a single set of encodings through The non-linear one, however, has an auto-encoder and two GSE layers. The last GSE layer, although equal to the first, yields an early output as it is not stacked with another GSE layer. multi-head attention. The number of expected features by the Transformer Encoder must be a multiple of the number of heads in the multi-head attention. Our encoder's nonconventionality comes from the fact that the first GSE layer's output goes through a single scaled dot-product attention on a single-head attention task. That is because the number of features produced by the encoder is equal to the length of the sliding window, and through single-head attention, the window can assume any length. The encoder module is defined as follows: where self-attention in Eq. 3a is a particular case of the multi-head attention, in which the input query Q, key K, and value V of the scaled dot-product attention, i.e., SOFTMAX Q · K T ÷ √ d k · V, are equal; and d k is the dimension of the keys. The attention results are followed by a dropout regularization [36] , a residual connection [39] , and a layer normalization [40] as in Eq. 3b, which ensure generalization. The first two layers work to avoid overfitting and gradient vanishing, while the last one normalizes the output such that the samples among the input have zero mean and unit variance γ (∆ (Y ε + ϕ (Y ε ))) + β, where ∆ is the normalization function, and γ and β are parameters to be learned. After, in Eq. 3c, the intermediate encoding goes through a double linear layer, a point-wise feed-forward layer, which, in this case, consists of two linear transformations in sequence with a ReLU activation in between, having the weights W ε , W ι and bias b ε , b ι as optimizable parameters. Finally, the transformed encoding goes through one last set of generalizing operations, as shown in Eq. 3d. The resulting encoding Y ε ∈ R s×ω×v is a tensor with the time-axis length matching the size of the sliding window ω, the same dimension as it is the input tensor. The previous encoding will be decoded by two sequenceto-sequence layers, which in this case, are Long Short Term Memory (LSTM) [28] units. The decoder operates in two of the tridimensional axes of the encoding, the time axis, and the variable axis, once at a time. During the time-axis decoding, the first recurrent unit will translate the windowsized input into a stride-sized output as in the following: where v is the v-th variable of the t-th time-series group, and the weights W o ∈ R z are parameters to be learned. Along with Eq. 4, we refer to f as the forget gate's activation vector, i as the input and update gate's activation vector, o as the output gate's activation vector, C as the cell input activation vector, C as the cell state vector, and h as the hidden state vector. The last hidden state vector, goes through a dropout regularization ϕ before the next LSTM in the sequence. The next recurrent unit decodes the variable axis from the partially-decoded encoding without changing the input dimension. The set of variables within the time-series does not necessarily imply a sequence that does not interfere in the decoding process as long as the variables are kept in the same position. The second LSTM in the sequence, in which t is the t-th timestamp of the v-th variable group, work as follows: where Y ε is the partially-decoded encoding, and the weights W o ∈ R z are parameters to be learned. The description of the notations within Eq. 4 holds for Eq. 5. The difference, besides the decoding target, is the residual connection with the partially-decoded encoding Y ε at the last hidden state vector after the dropout regularization ϕ. Finally, the output of the last recurrent unit Y goes through the last GSE layer, so to produce the non-linear output Y ψ of REGENN. REGENN operates on a tridimensional space shared between samples, timestamps, and variables. In such a space, it carries out a time-based optimization strategy. The training process iterates over the time-axis of the dataset, showing the network how the variables within a subset of timeseries behave as time goes by, and later repeating the process through subsets of different samples. The network's weights are shared among the entire dataset and optimized towards best generalization simultaneously across samples, timestamps, and variables. In this work, we used Adam [41] , a gradient descent-based algorithm, to optimize the model. As the optimization criterion, we used the Mean Absolute Error (MAE), which is a generalization of the Support Vector Regression [42] with soft-margin criterion where Ω is the set of internal parameters of REGENN, Y is the network's output and T the ground truth: Due to SARS-CoV-2 behave as a streaming time-series, we adopted a transfer learning approach to train the net-work on that dataset only. Transfer learning shares knowledge across different domains by using pre-trained weights of another neural network. The approach we adopted, although different, resembles Online Deep Learning [43] . The main idea is to train the network on incremental slices of the time-axis, such that the pre-trained weights of a previous slice are used to initialize the weights of the network in the next slice. The purpose of this technique is not only to achieve better forecasting performance but also to show that REGENN is superior to other algorithms throughout the pandemic. The results are based on three datasets, all of which are multi-sample, multivariate, and vary over time. The first dataset describes the Coronavirus Pandemic, refereed to as SARS-CoV-2, made available by John Hopkins University [44] . It describes 3 variables through 120 days for 188 countries and varies from the first day of the pandemic to the day it completed four months of duration. The second one is the Brazilian Weather 1 dataset collected from 253 sensors during 1,095 days regarding 4 variables. The third dataset is from the 2012 PhysioNet Computing in Cardiology Challenge [45] , from which we are using 9 variables across 48 hours recorded from 11,988 ICU patients. The variables within the datasets are: Data Pre-processing All datasets were normalized between zero and one along the variable axis. Such a pre-processing task is conventional to all types of learning algorithms but crucial to deep learning. By doing so, we speed up and avoid gradient spikes during the training phase. However, the neural network output is transformed back to the initial scale before computing any of the evaluation metrics. A simplistic yet effective approach to train time-series algorithms is through the Sliding Window technique [46] , also referred to as Rolling Window. The window size is well known to be a highly sensitive hyperparameter [47] , [48] . Consequently, we followed a non-tunable approach, in which we set the window size before the experiments, just taking into consideration the context and domain of the datasets. These values were used across all windowbased experiments, including the baselines and ablation 1 . Available together with the implementation source-code. tests. It is noteworthy that most of the machine-learning algorithms are not meant to handle time-variant data, such that no sliding window was used in those cases. Conversely, we considered training timestamps as features and those reserved for testing as tasks of a multitask regression. On the deep learning algorithms, we used a window size of 7 days for training and reserved 7 days for validation (between the training and test sets) to predict the last 14 days of the SARS-CoV-2 dataset. The 7-7-14 split idea comes from the disease incubation period, which is of 14 days. On the other hand, we used a window size of 84 days and reserved 28 days for validation to predict the last 56 days in the Brazilian Weather dataset. The 84-28-56 split is based on the seasonality of the weather data, such that we will look to the previous 3 months (a weather-season window) to predict the last 2 months of the upcoming season. Finally, we used a window size of 12 hours for training and 6 hours for validation to predict the last 6 hours of the PhysioNet dataset. The 12-6-6 split comes from the fact that patients in ICUs are in a critical state, such that predictions within 24 hours are more useful than long-term predictions. Many existing algorithms are limited because they neither support multitask nor multi-output regression, making these algorithms even more limited to tasks when data is tridimensional. The most straightforward yet effective approach we followed to compare them to REGENN was to create a chain of ensembles 2 . In such a case, each estimator makes predictions on order specified by the chain using all of the available features provided to the model plus the predictions of models that are earlier in the chain. The number of estimators in each experiment varies according to the type of the ensemble and the type of the algorithms, and the final performance is the average of each estimator's performance. For simplicity sake, we grouped the algorithms into five categories, as follows: Corresponds to tridimensional compliant algorithms of single estimators; ○ Describes multivariate algorithmss estimators, one estimator for each sample; ◎ Consists of multi-output and multitask algorithmsv estimators, one estimator per variable; Indicates single-target algorithms -v×z estimators, one estimator per variable and stride; and, Represents univariate algorithms -s×v estimators, one estimator for each sample and variable. As time-series forecasting poses as a time-aware regression problem, our goal remains in predicting values that resemble the ground truth the most. Hyperparameter Tuning Unlike many neural networks, REGENN has only two hyperparameters able to change the size of the weights' tensors, which are the window size (i.e., input size) and the stride size (i.e., output size). As already discussed, both were set before the experiments, and none of them were tuned towards any performance improvement. The trade-off of having fewer hyperparameters is to spend more energy on training the network towards a better performance. We are focusing on the network optimizer, gradient clipping, learning rate scheduler, and dropout regularization when we refer to tunable hyperparameters. Along these lines, we followed a controlled and limited exploratory approach similar to a random grid-search, starting with PYTORCH's defaults. The tuning process was on the validation set, intentionally reserved for measuring the network improvement. The tuning process follows by updating the hyperparameters whenever observing better results on the validation set, leading us to a set of optimized but no optimum hyperparameters. We used the set of optimized hyperparameters to evaluate the network on the test set. We used the default values for all the other algorithms [49] - [51] unless explicitly required for working with a particular dataset, as was the case of LSTNet [21] , DSANet [23] , and MLCNN [26] . The list of hyperparameters of REGENN and further deep-learning algorithms are in the supplementary material. The experiments related to machine-learning and timeseries algorithms were carried out on a Linux-based system with 64 CPUs and 750 GB of RAM for all the datasets. The experiments related to deep learning on the SARS-CoV-2 dataset were carried out on a Linux-based system with 56 CPUs, 256 GB of RAM, and 8 GPUs (Titan X -Pascal). The Brazilian Weather and PhysioNet datasets were tested on a different system with 32 CPUs, 512 GB of RAM, and 8 GPUs (Titan X -Maxwell). While CPU-based experiments are even across all CPU architectures, the same does not hold for GPUs, such that the GPU model and architecture must match to guarantee reproducibility. Aiming at complete reproducibility, we disclose not only the source code of REGENN on GitHub 3 , but also the scripts, pre-processed data, and snapshots of all the networks trained on a public folder at Google Drive 4 . 3. Available at https://bit.ly/2YDBrOo. 4. Available at https://bit.ly/31x6CwN. Subsequently, we go through the experimental results in each one of the benchmarking datasets. Due to not all the algorithms performed evenly across the datasets, we display the 34 most prominent ones out of the 49 tested algorithms; for the extended results, see the supplementary material. Additionally, we also discuss the ablation experiments, which were carried out with REGENN's hyperparameters; in the supplementary material, we provide two other rounds of this same experiment using as hyperparameters the PYTORCH's defaults 5 and others recurrently employed on the literature. At the end of each experiment, we draw explanations about the Evolution Weights, i.e., intermediate adjacency matrices from the GSE layers, by using the cosine similarity on pairs of co-occurring variables. The SARS-CoV-2 has being updated daily since the beginning of the Coronavirus Pandemic. We used a self-to-self transfer-learning approach to train the network in slices of time due to such a dataset's streaming nature. In short, the network was re-trained every 15 days with new incoming data, using as starting weights, the pre-trained weights from the network trained in the past 15 As a result of the analysis of the dataset in time-slices, we were able to notice that, as time goes by and more information is available on the SARS-CoV-2 dataset, the problem becomes more challenging to solve by looking individually at each country, and more natural when looking to all of them together. Although countries have their particularities, which make the disease spread in different ways, the main goal is to decrease the spreading, such that similarities between the historical data of different countries provide for finer predictions. Furthermore, we also observed that not all the estimators within an ensemble perform in the same way in the face of different countries. Due to the REGENN capability of observing inter-and intra-relationships between time-series, it performs better on highly uncertain cases like this one. Subsequently, we present the ablation results, in which we utilized the same data-flow as REGENN but no GSE 5 . Available at https://bit.ly/2QuWRsD. . Results presented in descending order of metric MAE (i.e., from worst to best performance); the algorithms are symbol-encoded according to their number of estimators. We use gray arrows to describe the standard deviation of the results; the negative deviation, which is equal to the positive one, was suppressed for better visualization. The results confirmed REGENN's superior performance as it is the algorithm with the lowest error and standard deviation compared to the others, such that the improvement in the experiment was no lower than 64.87%. layer while systematically changing the decoder architecture. We provide results using different Recurrent Units (RU), which are the Elman RNN [52] , LSTM [28] , and GRU [22] . We also varied the directional flag of the recurrent unit between Unidirectional (U) and Bidirectional (B). That because a unidirectional recurrent unit tracks only forward dependencies while a bidirectional one tracks both forward and backward dependencies. Additionally, the network architecture of each test is described by a summarized tag. For example, given the architecture (E → U RU + B RU)+AR, it means the model has a Transformer Encoder as the encoder, a Unidirectional Recurrent Unit as the time-axis decoder, and a Bidirectional Recurrent Unit as the variable-axis decoder. Besides, the output of the decoder is element-wise added to the Autoregression (AR) output. Furthermore, the table shows results with and without the encoder and Autoregression component, as well as cases when using a single recurrent unit only for time-axis decoding. According to the ablation results detailed in Tab. 2, one can observe that the improvement of REGENN is slightly reduced than previously reported. That is because the performance of it does not only comes from the GSE layer but also from how the network handles the multiple multivariate time-series data. Consequently, the ablation experiments reveal that some models without GSE layers are enough to surpass all the competing algorithms. However, when using REGENN, we can improve them further and achieve 20.81% of additional reduction on the MAE, 19.77% on the RMSE, and 35.72% on the MSLE. Fig. 6 shows the Evolution Weights originated from applying the cosine similarity on the hidden adjacency matrices of REGENN. When comparing the input and evolved graphs, the number of cases and deaths has a mild similarity. That might come from the fact that, at the beginning of the pandemic, diagnosing infected people was already a broad concern. The problem did not go away, but more infected people were discovered as more tests were made, and also because the disease spread worldwide. A similar scenario can be drawn from the number of recovered and the number of cases, as infected people with mild or no symptoms were unaware of being infected. Contrarily, we can see that the similarity between the recovered and deaths decreases over time, which comes from the fact that, as more tests are made, the mortality rate drops to a stable threshold due to the increased number of recovered people. The Brazilian Weather dataset is a highly seasonal dataset with a higher number of samples, variables, and timestamps than the previous one. For simplicity's sake, in this experiment, REGENN was trained on the whole training set at once. The results are in Fig. 7 , in which REGENN was the first-placed algorithm, followed by the Elman RNN in second. REGENN overcame the Elman RNN by 11.95% on the MAE, 11.96% on the RMSE, and 25.84% on the MSLE. We noticed that all the algorithms perform quite similarly for this dataset. The major downside for most algorithms comes from predicting small values that are close to zero, as noted by the MSLE results. In such a case, the ensembles showed a high variance when compared to REGENN. We believe this is why the Elman RNN shows performance closer to REGENN rather than to Exponential Smoothing, the third-placed algorithm, as REGENN has a single estimator, while the Exponential Smoothing is an ensemble of estimators. Another understanding of why some algorithms underperform on the MSLE might be related to their difficulty to track temporal dependencies, which embraces the weather seasonality. The ablation results are in Tab. 3, in which we observed again that the network without the GSE layers already surpasses the baselines. When decommissioning the GSE layers of REGENN and using GRU instead of LSTM on the decoder, we observed a 3.86% improvement on the MAE, 10.02% on the RMSE, and 25.34% MSLE when compared to the Elman RNN results. Using REGENN instead, we achieve a further performance gain of 8.42% on the MAE and 2.16% on the RMSE over the ablation experiment. Fig. 8 depicts the Evolution Weights for the current dataset, in which we can observe a consistent similarity between pairs of variables in the input graph, which does not repeat in the evolved graph, implying different relationships. On the evolved graph, we observe that the similarity between all pairs of variables increased as the graph evolved. The pairs Solar Radiation and Rain, Maximum Autoregressive Baselines results for the Brazilian Weather dataset ordered from worst to best MAE performance. Along with the image, the algorithms are symbol-encoded based on their type and number of estimators, and we use gray arrows to report the standard deviation of the results; the negative deviation, which is equal to the positive one, was suppressed for improved readability. In such an experiment, REGENN once more outperformed all the competing algorithms, demonstrating versatility by performing well even on a highly-seasonal dataset with improvement no lower than 11.95%. In the face of seasonality, the Elman RNN surpassed the Exponential Smoothing, the previously second-placed algorithm. Temperature and Rain, and Solar Radiation and Minimum Temperature stood out. Those pairs are mutually related, which comes from solar radiation interfering in both maximum and minimum temperature and also in the precipitation factors, where the opposite relation holds. What can be extracted from the Evolution Weights, in this case, is the notion of importance between pairs of variables, so that the pairs that stood out are more relevant and provide better information to predict the forthcoming values for the variables in the dataset. The PhysioNet dataset presents a large number of samples and an increased number of variables, but little information on the time axis, a setting in which ensembles still struggle to perform accurate predictions, as depicted in Fig. 9 . Once again, REGENN keeps steady as the first-placed algorithm in performance, showing solid improvement over the Linear SVR, the second-placed algorithm. The improvement was 7.33% on the MAE and 35.13% on the MSLE, while the RMSE achieved by REGENN laid within the standard deviation of the Linear SVR, pointing out an equivalent performance between them. The Linear SVR is an ensemble with multiple estimators, while REGENN uses a single one, which makes it better accurate and more straightforward for dealing with the current dataset. As in Tab. 4, the ablation results reveal that a neural network architecture without the GSE layers can achieve a better performance than the baseline algorithms. In this specific case, we see that by using a bidirectional LSTM instead of unidirectional on the decoder module of the neural network, we can achieve a performance almost as good as REGENN, but not enough to surpass it, as REGENN still shows an improvement of 1.05% on the MAE and 0.98% on the RMSE over the ablation experiment with bidirectional LSTM. In this specific case, REGENN learns by observing multiple ICU patients. However, one cannot say that an ICU patient's state is somehow connected to another patient's state. Contrarily, the idea holds as in the first experiment, where although the samples are different, they have the same domain, variables, and timestream, such that the information from one sample might help enhance future forecasting for another one. That means REGENN learns both from the past of the patient but also from the past of other patients. Nevertheless, we must be careful about drawing any understanding about these results, as the reason each patient is in the ICU is different, and while some explanations might be suited for a small set of patients, it tends not to generalize to a significant number of patients. When analyzing the Evolution Weights in Fig. 10 aided by a physician, we can say that there is a relationship between the amount of urine excreted by a patient and the arterial blood pressure, and also that there is a relation between the systolic and diastolic blood pressure. However, even aided by the Evolution Weights, we cannot further describe these relations once there are variables of the biological domain that are not being taken into consideration. The Evolution Weights are intermediate weights of the representation-learning process (see Fig. 4 ), which are optimized throughout the network's training. Such weights are time-invariant and are a requirement for the featurelearning behind the GSE layer. Although time does not flow through the adjacency matrix, the network is optimized as a whole, such that every operation influences the gradients resulting from the backward propagation process. That means that the optimizer, influenced by the gradients of both time-variant and invariant data, will optimize all the weights towards a better forecasting ability. Such a process depends not only on the network architecture but also on the reliability of the optimization process. That increases uncertainty, which is the downside of RE-GENN, demanding more time to train the neural network, and causing the improvement not to be strictly uprising. Baseline results for the PhysioNet dataset arranged from the worst to the best MAE performance, in which, REGENN was the first-placed algorithm followed by the Linear SVR in the second one. The improvement from one to another was no lower than 7.33%, but, in this case, REGENN yielded an RMSE compatible with the Linear SVR. Along with the image, the algorithms are symbol-encoded based on type and number of estimators, and gray arrows depict the standard deviation of the results; the negative deviation, which is equal to the positive one, was suppressed to provide better visualization for the results. Fig. 10 : Evolution Weights extracted from REGENN after training on the PhysioNet dataset, in which we use cosine similarity to compare the relationship between pairs of variables. We use "ABP" as shortening for Arterial Blood Pressure, "NI" as Non-Invasive, "Dias" as Diastolic, and "Sys" as Systolic. Consequently, training might take long sessions, even with consistently reduced learning rates on plateaus or simulated annealing techniques; this is influenced by the fact that the second GSE layer has two dynamic inputs, which arise from the graph-evolution process. However, we observed that through the epochs, the Evolution Weights reaches a stable point with no major updates, and as a result, the network demonstrates a remarkable improvement in its last iterations when the remaining weights more intensely converge to a near-optimal configuration. Even though REGENNhas a particular drawback, it shows excellent versatility, which comes from its superior performance in the task of epidemiology modeling on the SARS-CoV-2 dataset, climate forecasting on the Brazilian Weather, and patient monitoring on intensive care units on the PhysioNet dataset. Consequently, we see REGENN as a tool to be used in data-driven decision-making tasks, helping prevent, for instance, natural disaster, or during the preparation for an upcoming pandemic. As a precursor in multiple multivariate time-series forecasting, there is still much to be improved. For example, reducing the uncertainty that harms REGENN without decreasing its performance should be the first step, followed by extending the proposal to handle problems in the spatio-temporal field of great interest to traffic forecasting and environmental monitoring. Another possibility would be to remove the recurrent layers within the decoder while tracking the temporal dependencies through multiple graphs, which would provide a whole new temporal-modeling way. Notwithstanding, in some cases, where extensive generalization is not required, the analysis of singular multivariate time-series may be preferred to multiple multivariate time-series. That because, when focusing on a single series at a time, some but not all samples might yield a lower forecasting error, as the model will be driven to a single multivariate sample. However, both approaches for tackling time-series forecasting can coexist in the state-of-the-art, and, as a consequence, the decision to work on a higher or lower dimensionality must relate to which problem is being solved and how much data is available to solve it. This paper tackles multiple multivariate time-series forecasting tasks by proposing the Recurrent Graph Evolution Neural Network (REGENN), a graph-based time-aware auto-encoder, powered by a pair of Graph Soft Evolution (GSE) layers, a further contribution of this study that stands for a graph-based learning-representation layer. The literature handles multivariate time-series forecasting with outstanding performance, but up to this point, we lacked a technique with increased generalization over multiple multivariate time-series with sound performance. Previous research might have avoided tackling such a problem as a neural network to that matter is challenging to train, and usually yield poor results. That because one aims to achieve good generalization on future observations for multivariate time-series that do not necessarily hold the same data distribution. Because of that, REGENN is a precursor in multiple multivariate time-series forecasting, and even though this is a challenging problem, REGENN surpassed all baselines and remained effective after three rounds of 30 ablation tests through distinct hyperparameters. The experiments were carried out over the SARS-CoV-2, Brazilian Weather, and PhysioNet datasets. Showing improvements, respectively, of at least 64.87%, 11.96%, and 7.33%. As a consequence of the results, REGENN shows a new range of possibilities in time-series forecasting, starting by demonstrating that ensembles perform poorly than a single model that understands the entanglement between different variables by looking to how variables interact as the time goes by and multiple multivariate time-series evolve. This work was partially supported by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior -Brazil (CAPES) -Finance Code 001; Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP), through grants 2014/25337-0, 2016/17078-0, 2017/08376-0, 2019/04461-9, and 2020/07200-9; Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) through grants 167967/2017-7, and 305580/2017-5; National Science Foundation awards IIS-1418511, CCF-1533768, and IIS1838042; and, the National Institute of Health awards NIH R01 1R01NS107291-01 and R56HL138415. We thank Jeffrey Valdez for his aid with Sunlab's computer infrastructure, Lucas Scabora, for his careful review of the paper and Gustavo Merchan, M.D., for his analysis of the Evolution Weights on the PhysioNet dataset. Baselines Notes. Table 1 list the acronym and full name of all algorithms we tested during the baselines computation. Tables 2 to 6 present detailed information from the experiments discussed along with the main manuscript. The following tables regard the tests using Transfer Learning on the SARS-CoV-2 dataset, in which a new network was trained every 15 days starting on 45 days after the pandemic started and up to 120 days of its duration. Cosine Similarity. The cosine similarity, which has been widely applied in learning approaches, accounts for the similarity between two non-zero vectors based on their orientation in an inner product space [1] . The underlying idea is that the similarity is a function of the cosine angle θ between vectors Hence, when θ = 1, the two vectors in the inner product space have the same orientation, when θ = 0, these vectors are oriented a 90 • relative to each other, and when θ = −1 the vectors are diametrically opposed. The cosine similarity between the vectors u and v is defined as it follows: LG] 28 Aug 2020 the dot product between u and v and u represent the norm of the vector u = √ u · u, while u i is the i-th variable of the object represented by the vector. In the scope of this work, the cosine similarity is used to build similarity adjacency matrices, which measures the similarity between all nodes in a variables' co-occurrence graph. The similarity between two nodes in the graph describes how likely those two variables are to co-occur at the same time for a time-series. In this case, the similarity ends up acting as an intermediate activation function, enabling the graph evolution process by maintaining the similarity of the relationships between pairs of nodes. In such a particular case, we define the cosine-matrix similarity as follows: where A · A T denotes the dot product between the matrix A and the transposed A T , while A represents the norm of that same matrix with respect to any of its ranks, as we consider A to be a squared adjacency matrix. Horizon Forecasting. The horizon forecasting stands for an approach used for making non-continuous predictions by accounting for a future gap in the data. It is useful in a range of applications by considering, for instance, that recent data is not available or too costly to be collected. Thereby, it is possible to optimize a model that disregards the near future and focuses on the far-away future. However, such an approach abdicates from additional information that could be learned from continuous timestamp predictions [2] . By not considering the near past as a variable that influences the near future, we might result in a non-stochastic view of time, meaning that the algorithm focuses on long-term dependencies rather than both long-and short-term dependencies. Along these lines, both the LSTNet [3] and DSANet [4] comply with horizon forecasting, and to make our results comparable, we set the horizon to one on both of them. Thus, we started assessing the test results right after the algorithms' last validation step because, as closer to the horizon, the more accurate these models should be. A simplistic yet effective approach to train time-series algorithms is through the Sliding Window technique [5] , which is also referred to as Rolling Window (see Fig. 1 ). Such a technique fixes a window size, which slides over the time axis, predicting a predefined number of future steps, referred to as stride. Some studies on time-series have been using a variant technique known as Expanding Sliding Window [6, 7] . This variant starts with a prefixed window size, which grows as it slides, showing more information to the algorithm as time goes by. ReGENN holds for the traditional technique as it is bounded to the tensor weights dimension. Those dimensions are of a preset size and cannot be effortlessly changed during training, as it comes with increased uncertainty by continuously changing the number of internal parameters, such that a conventional neural network optimizer cannot handle it properly. Nevertheless, the window size of the Sliding Window is well known to be a highly sensitive hyperparameter [8, 9] ; to avoid increased number of parameters, we followed a nontunable approach, in which we set the window size before the experiments taking into consideration the context of the datasets; such values were even across all window-based trials, including the baselines and ablation. Optimization Strategy. ReGENN operates on a three-dimensional space shared between samples, time, and variables. In such a space, it carries out a time-based optimization strategy. The training process iterates over the time-axis of the dataset, showing to the network how the variables within a subset of time-series behave as time goes by, and later repeating the process through subsets of different samples. The network's weights are shared among the entire dataset and optimized towards best generalization simultaneously across samples, time, and variables. The dataset T ∈ R s×t×v is sliced into training T ∈ R s×w×v and testing T ∈ R s×z×v as follows: Once the data is sliced, we follow by using a gradient descent-based algorithm to optimize the model. In the scope of this work, we used Adam [10] as the optimizer, as it is the most common optimizer among timeseries forecasting problems. As the optimization criterion, we used the Mean Absolute Error (MAE), which is a generalization of the Support Vector Regression [11] with soft-margin criterion formalized as it follows: where w is the set of optimizable parameters, · F is the Frobenius norm, and both C and ρ are hyperparameters. The idea, then, is to find w that better fit y i , x i ∀i ∈ [1, n] so that all values are in [ρ + ξ i , ρ + ξ * i ], where ξ i and ξ * i are the two farther opposite points in the dataset. A similar formulation on the Linear SVR implementation for horizon forecasting was presented by Lai et al. [3] . Due to the higher-dimensionality among the multiple multivariate time-series used in this study, in which we consider the time to be continuous, the problem becomes: where Ω is the set of internal parameters of ReGENN, Y is the output of the network and T the ground truth. When disregarding C and setting ρ as zero, we can reduce the problem to the MAE loss formulation: Square-and logarithm-based criteria can also be used with ReGENN. We avoid doing so, as this is a decision that should be made based on each dataset. Contrarily, we follow the SVR path towards the evaluation of absolute values, which is less sensitive to outliers and enables ReGENN to be applied on a range of applications. Transfer-Learning Approach. We adopted a transfer learning approach to train the network on the SARS-CoV-2 dataset that, although different, resembles Online Deep Learning [12] . The idea is to train the network on incremental slices of the time-axis, such that the pre-trained weights of a previous slice are used to initialize the weights of the network in the next slice (see Fig. 2 ). The purpose of this technique is not only to achieve better performance towards the network but also to show that ReGENN is useful throughout the pandemic. Slice Hyperparameters adjustment is usually required when transferring the weights from one network to another, mainly the learning rate; for the list of hyperparameters we used, see Tab. 3. Besides, we deliberately applied a 20% Dropout on all tensor weights outside the network architecture and before starting the training. The aim behind that decision was to insert randomness in the pipeline and avoid local optima. It worth mentioning that we did not observe any decrease in performance, but in some cases, the optimizer's convergence was slower. Baselines Algorithms. Open-source Python libraries provided the time series and machine learning algorithms used along with the experiments. Time series algorithms came from the statsmodels 1 , while the machine learning ones majorly from the Scikit-Learn 2 . Further algorithms, such as XGBoost 3 , LGBM 4 , and CatBoost 5 , have a proprietary, open-source implementation, which was preferred over the others. We used the default hyperparameters over all the experiments, performing no fine-tuning. However, because all the datasets we tested are strictly positive, we forced all the negative output to become zero, such as made by a ReLU activation function. A list with names and algorithms tested along with the experiments is provided in Tab 1, which contains more algorithms than we reported in the main paper. That because we are listing all algorithms, even the ones that were removed from the pipeline due to being incapable of working with the input data and yielding exceptions. Hyperparameters for the Ablation Tests Legend: Algorithms with best performance are in bold, the ones noted as -yielded exceptions, and others as *** were suppressed due to poor performance. Isotonic Table 9 : Detailed results for the first three slices of the SARS-CoV-2. Legend: Algorithms with best performance are in bold, the ones noted as -yielded exceptions, and others as *** were suppressed due to poor performance. Legend: Algorithms with best performance are in bold, the ones noted as -yielded exceptions, and others as *** were suppressed due to poor performance. Effects of climate and land-use changes on fish catches across lakes at a global scale Rising river flows throughout the twenty-first century in two himalayan glacierized watersheds The impact of climate change and glacier mass loss on the hydrology in the mont-blanc massif Stock market prediction using optimized deep-convlstm model Stock market analysis using candlestick regression and market trend prediction (CKRM) Robust landsat-based crop time series modelling Continuous monitoring of land disturbance based on landsat time series Generic and scalable framework for automated time-series anomaly detection Multivariate time series anomaly detection: A framework of hidden markov models A review on time series forecasting techniques for building energy consumption Multi-layer representation learning for medical concepts Temporal phenotyping of medically complex children via parafac2 tensor factorization Patient trajectory prediction in the mimic-iii dataset, challenges and pitfalls Taste: Temporal and static tensor factorization for phenotyping electronic health records An interpretable mortality prediction model for covid-19 patients On the responsible use of digital data to tackle the covid-19 pandemic Temporal dynamics in viral shedding and transmissibility of covid-19 Temporal aggregation of univariate and multivariate time series models: A survey Artificial neural networks: a tutorial A simulation study of artificial neural networks for nonlinear time-series forecasting Modeling long-and short-term temporal patterns with deep neural networks Empirical evaluation of gated recurrent neural networks on sequence modeling Dsanet: Dual selfattention network for multivariate time series forecasting Dstp-rnn: A dual-stage two-phase attention-based recurrent neural network for longterm and multivariate time series prediction Attention is all you need Towards better forecasting by fusing near and distant future visions Gradient-based learning applied to document recognition Long Short-Term Memory Recurrent Neural Networks for Multivariate Time Series with Missing Values Geoman: Multilevel attention networks for geo-sensory time series prediction IJCAI-18. International Joint Conferences on Artificial Intelligence Organization Semi-supervised classification with graph convolutional networks T-gcn: A temporal graph convolutional network for traffic prediction Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting A dual-stage attention-based recurrent neural network for time series prediction Multivariate time series forecasting via attention-based encoder-decoder framework Dropout: A simple way to prevent neural networks from overfitting Highway networks Deep residual learning for image recognition Layer normalization Adam: A method for stochastic optimization Support vector method for function approximation, regression estimation and signal processing Online deep learning: Learning deep neural networks on the fly An interactive web-based dashboard to track COVID-19 in real time Predicting in-hospital mortality of icu patients: The physionet/computing in cardiology challenge 2012 SEGMENTING TIME SERIES: A SURVEY AND NOVEL APPROACH Input window size and neural network predictors Time Series Prediction and Neural Networks Pedregosa2011:sklearn: Machine learning in python XGBoost: A scalable tree boosting system Catboost: unbiased boosting with categorical features Finding structure in time Data mining introduction Towards better forecasting by fusing near and distant future visions Modeling long-and short-term temporal patterns with deep neural networks Dsanet: Dual self-attention network for multivariate time series forecasting SEGMENTING TIME SERIES: A SURVEY AND NOVEL APPROACH Picture fuzzy time series: Defining, modeling and creating a new forecasting method Machine learning time series regressions with an application to nowcasting Input window size and neural network predictors Time Series Prediction and Neural Networks Adam: A method for stochastic optimization Support vector method for function approximation, regression estimation and signal processing Online deep learning: Learning deep neural networks on the fly