key: cord-0527997-9nqzxkg3 authors: Satorras, Victor Garcia; Rangapuram, Syama Sundar; Januschowski, Tim title: Multivariate Time Series Forecasting with Latent Graph Inference date: 2022-03-07 journal: nan DOI: nan sha: ab84c5286a1b67abca97574710cf1a4562c50a2d doc_id: 527997 cord_uid: 9nqzxkg3 This paper introduces a new approach for Multivariate Time Series forecasting that jointly infers and leverages relations among time series. Its modularity allows it to be integrated with current univariate methods. Our approach allows to trade-off accuracy and computational efficiency gradually via offering on one extreme inference of a potentially fully-connected graph or on another extreme a bipartite graph. In the potentially fully-connected case we consider all pair-wise interactions among time-series which yields the best forecasting accuracy. Conversely, the bipartite case leverages the dependency structure by inter-communicating the N time series through a small set of K auxiliary nodes that we introduce. This reduces the time and memory complexity w.r.t. previous graph inference methods from O(N^2) to O(NK) with a small trade-off in accuracy. We demonstrate the effectiveness of our model in a variety of datasets where both of its variants perform better or very competitively to previous graph inference methods in terms of forecasting accuracy and time efficiency. Time Series Forecasting (TSF) has been widely studied due to its practical significance in a variety of applications such as climate modelling (Mudelsee, 2019) , supply chain management in retail (Böse et al., 2017; Larson, 2001) , market analysis in finance (Andersen et al., 2005) , traffic control (Li et al., 2017) and medicine (Kaushik et al., 2020) among others. In TSF, given a sequence of data points indexed over time, we aim to estimate its future values based on previously observed data. Data is often multivariate, meaning that multiple variables vary over time, each variable may not only depend on its own historical values, but also on other variables' past. Efficiently modelling the dependencies among these variables is still an open problem. Multivariate Time Series (MTS) methods aim to leverage dependencies between variables in order to improve the forecasting accuracy. A natural way to model non-linear dependencies in deep learning is via Graph Neural Networks (GNNs) (Bruna et al., 2013; Defferrard et al., 2016; Kipf & Welling, 2016) . In fact, GNNs have been successfully applied to MTS forecasting (Chen et al., 2020a; Li et al., 2017; Yu et al., 2017) , leveraging the relations among different series. But these methods require a pre-defined adjacency matrix which may only be available in some specific datasets, for example, traffic datasets, where it can be constructed from the spatial structure of a city. More recently, a family of methods that do not require a pre-defined adjacency matrix have been proposed (Franceschi et al., 2019; Shang et al., 2021; Wu et al., 2019 Wu et al., , 2020 . In this case, a latent graph representation is inferred while forecasting, allowing to operate on a larger variety of datasets. Our work belongs to this category. However, inferring all pairwise relations may come at a higher computational cost, as the number of relations scales quadratically O(N 2 ) w.r.t. the number of nodes/time series. This limits the scalability to large datasets. Additionally, the above mentioned latent graph inference methods perform message passing updates at every time step iteration which is also expensive. To overcome these limitations, we propose a new latent graph inference algorithm for MTS forecasting that is more efficient than previous algorithms while achieving better or competitive performance. We cast the latent graph inference as a modular and easy-to-implement extension to current univariate models. The graph is dynamically inferred for each time series input allowing a more flexible representation than a static graph for the whole dataset. Additionally, we can reduce the complexity from O(N 2 ) (Fully Connected Assumption) to O(N K) (Bipartite Assumption) where K N for a small trade off in performance. In time series forecasting we aim to estimate a future time series x t+1:T given its past x t0:t where t 0 ≤ t ≤ T indexes over time, and (optionally) some context information c. For the multivariate case we assume the time series is composed of N variates at a time such that x t0:T = {x 1,t0:T , . . . , x N,t0:T } ∈ R N ×T −t0+1 . In this section we distinguish two main categories of time series forecasting methods, Global Univariate and Multivariate. Global Univariate methods: In this case we only use the past of each univariate to predict its future. However, the model weights θ u are shared across all univariate time series. More formally:x wherex denotes the estimated values, i ∈ {1, . . . , N } indexes over multivariates and f u (·) is the estimator function with learnable parameters θ u shared across time series. Conditioning on the past of each univariate may limit the performance of the forecasting algorithm compared to multivariate ones. Despite that, it simplifies the design of f θ and already provides reasonable results. A popular example of a global univariate model is DeepAR (Salinas et al., 2020) . Multivariate methods: Multivariate methods condition on all past data (all N variates) and directly predict the multivariate target. More formally: x t+1:T = f m (x t0:t , c). (2) Different variables may be correlated and/or depend on the same con-founders. For example, in retail forecasting, PPE masks and antibacterial soaps jointly increased in demand during the early days of the COVID-19 pandemic. In traffic forecasting, an increase of the outcome traffic flow in a given neighborhood may result in an increase of the income traffic flow on another one. Modelling these dependencies may improve the forecasting accuracy, but it may come at a cost of higher complexity and hence more expensive algorithms, especially when trying to model all pairwise interactions between variates. Graph Neural Networks (GNNs) (Bruna et al., 2013; Defferrard et al., 2016; Kipf & Welling, 2016) operate directly on graph structured data. They have gained a lot of attention in the last years due to their success in a large variety of domains which benefit from modelling interactions between different nodes/entities. In the context of multivariate time series, GNNs can be used to model the interactions between time series. In this work we consider the type of GNN introduced by (Gilmer et al., 2017) . Given a graph G = (V, E) with nodes v i ∈ V and edges e ij ∈ E, we define a graph convolutional layer as: Where φ e and φ h are the edge and node functions, usually approximated as Multi Layer Perceptrons (MLPs), h l i ∈ R nf is the nf-dimensional embedding of a node v i at layer l and m ij is the edge embedding that propagates information from node v j to v i . A GNN is constructed by stacking multiple of these Graph Convolutional Layers h l+1 = GCL[h l , E]. Additionally, in (3) we include α ij ∈ (0, 1) which is a scalar value that performs the edge inference or attention over the neighbors similarly to Veličković et al. (2017) . Following , we choose this value to be computed as the output of a function α ij = φ α (m ij ) where φ α is composed of just a linear layer followed by a sigmoid activation function. Time series forecasting has been extensively studied in the past due to its practical significance with a number of recent overview articles available (Benidis et al., 2020; Lim & Zohren, 2021; Petropoulos et al., 2022) . Traditionally, most classical methods are univariate in nature (see e.g., Hyndman & Athanasopoulos (2017) for an overview). While some of these have multi-variate extensions (e.g., ARMA and VARMA models), they are limited by the amount of related time series information they can incorporate. Dynamic factor models (Geweke, 1977; Wang et al., 2019a) are fore-runners of a family of models that has recently received more attention, the so-called global models (Januschowski et al., 2019; Montero-Manso & Hyndman, 2022) , see Section 2.1. These global models estimate their parameters over an entire panel of time series, but still produce a univariate forecast. Many such global models have been proposed building on neural network architectures like RNNs (Bandara et al., 2019; Liberty et al., 2020; Salinas et al., 2020) , CNNs (Chen et al., 2020b; Wen et al., 2017) , Transformers (Eisenach et al., 2020; Li et al., 2019; and also combining classical probabilistic models with deep learning Kurle et al., 2020; Rangapuram et al., 2018) . However, unlike our method, these global models do not explicitly model the relationship between the time series in the panel. Most recently, global multi-variate forecasting models have received attention, in particular models that attempt to capture the relationship of the time series via a multi-variate likelihood Rasul et al., 2020 Rasul et al., , 2021 Salinas et al., 2019) . A complementary approach consists of capturing the multi-variate nature of many modern forecasting problems primarily by using a multi-variate time series as input. For this, a natural way to model and exploit the relationship between time series is via GNNs (Bruna et al., 2013; Defferrard et al., 2016; Kipf & Welling, 2016) . Even in those scenarios where the adjacency of the graph is explicitly provided, attention or a latent graph can be inferred from the node embeddings such that GNNs can still leverage the structure of the data. Some examples of latent graph inference or attention are (Wang et al., 2019b) in point clouds, (Franceschi et al., 2019) in semi-supervised graph classification, (Ying et al., 2018) in hierarchical graph representation learning, (Kipf et al., 2018) in modelling dynamical systems, (Kazi et al., 2020) in zero-shot learning and 3D point cloud segmentation, (Garcia & Bruna, 2017; Kossen et al., 2021) in image classification, (Cranmer et al., 2020) in inferring symbolic representations and (Fuchs et al., 2020; Satorras et al., 2021) in molecular property prediction. In MTS forecasting, Lai et al. (2018); Shih et al. (2019) provide some of the first deep learning approaches designed to leverage pair-wise dependencies among time series. More recent methods (Li et al., 2017; Seo et al., 2018; Yu et al., 2017; Zhao et al., 2019) are built in the intersection of GNNs and time series forecasting but they require a pre-defined adjacency matrix. Lately, new methods that infer a latent graph from the node embeddings have been introduced in MTS forecasting (Cao et al., 2021; Shang et al., 2021; Wu et al., 2020) , and in MTS anomaly detection (Deng & Hooi, 2021; Zhang et al., 2020) . These methods can be applied to any dataset even when there is not an explicitly defined adjacency matrix. The main limitation is the computational cost, which occurs in the edge inference and scales quadratically O(N 2 ) w.r.t the number of nodes/time series N or O(N 3 ) in (Cao et al., 2021; Wu et al., 2020) . Our approach is related to (Shang et al., 2021; Wu et al., 2020) , but in contrast (i) our latent graph is dynamically inferred for each input instead of static for the whole dataset; (ii) it only requires the message exchange at time step t which renders the graph operation cheaper and modular; and (iii) it optionally reduces the number of edges from O(N 2 ) to O(N K) when using a bipartite assumption (K N ). Linear vs Non-linear message passing A shared potential limitation among the mentioned methods that use GNNs for MTS Forecasting (Cao et al., 2021; Li et al., 2017; Seo et al., 2018; Shang et al., 2021; Wu et al., 2019 Wu et al., , 2020 Yu et al., 2017; Zhao et al., 2019) is the use of a linear function in the edge operation (Defferrard et al., 2016; Kipf & Welling, 2016) . In contrast, we allow for non-linearities in our model. Table 1 provides an overview comparing these two alternatives by writing the linear graph convolutional layer in the same format as the non-linear one (Gilmer et al., 2017) from Section 2.2. Where θ e , θ h ∈ R nf×nf are learnable parameters. The linear graph convolutional layer is commonly written in matrix form as we show in Appendix A.3. Note that while Edge the non-linear alternative approximates the edge operation with an MLP φ e , in the linear case the message operation among two nodes (i, j) only depends on node h j and it is agnostic to the identity of the receiver node i. This can be problematic when inferring relations among two nodes since the message operation only depends on one of them. Therefore, when no adjacency matrix is provided, an attention mechanism that depends on both i, j can address some of these limitations m ij = α(i, j)θ e h j . However, note that in the non-linear case, φ e is a universal approximator that can be interpreted as a generalization of the linear case with attention φ e (h i , h j ) = α(i, j)θ e h j . In practice, even if the non-linear case can more flexibly model m ij , it has been shown in previous literature that attention or edge inference Vaswani et al., 2017) can be a strong inductive bias which in our case is able to infer the graph structure as we will show in the experiments section. In Section 2.1 we discussed two families of forecasting methods: Global Univariate Models and Multivariate Models. We cast our multivariate algorithm as a modular extension of the univariate case from (1). We can break down the Univariate Model in two main steps f u = f enc • f dec such that x i,t0:t fenc −→ z i fdec −→x i,t+1:T , where f enc encodes the input signal x i,t0,t (and optionally some context information c i ) into an embedding z i , and f dec estimates the future signal from this embedding. In our method, we include a multivariate aggregation module AGG in between f enc and f dec that propagates information among nodes in the latent space z = {z 1 , . . . , z N }. This aggregation module takes as input the embedding z i = f enc (x i,t0:t , c i ), and outputs a modified embeddingẑ = AGG(z) where information has been propagated among nodes. These new embeddings are then passed as input to the decoder f dec . The resulting algorithm is: Multivariate extensionẑ = AGG(z) Univariate Decoderx i,t+1: Notice the overall model is multivariate but f enc and f dec remain univariate, allowing a modular extension from current univariate methods to multivariate. Additionally, in contrast to the most recent methods (Shang et al., 2021 ; Wu et al., 2020), our model does not propagate information among nodes at every time step [t 0 , . . . , t] but only in the AGG module. This makes the algorithm cheaper since the message propagation step is usually an expensive operation. Additionally, we experimented including a unique identifier of the time series in the encoded signal z i as context information c i = id, which resulted in a significant improvement in accuracy as we will show in the experiments section. In the following subsections we propose two different GNN configurations for the AGG module depending on the assumed structure of the inferred graph. One is a potentially fully connected graph and the other is a bipartite graph that communicates the N time series through a small set of K auxiliary nodes that we introduce, each has its benefits and weaknesses regarding performance and scalability. An illustration of the whole algorithm is presented in Figure 1 . From the two configurations, this is the most straightforward to implement given a standard GNN. It results in good performance, but its complexity scales O(N 2 ) with respect to the number of nodes N . Despite this, it still resulted in faster computation times than previous O(N 2 ) methods. We can start by defining a fully connected graph G = {V, E} where all nodes exchange messages among each other such that e ij = 1 for all e ij ∈ E. Each time series embedding z i obtained from the Univariate Encoder (4) is associated with a node of the graph v i ∈ V. Then, we can directly use the GNN defined in Section (2.2) as the aggregation module AGG where each embedding z i is provided as the input h 0 i to to the GNN (3), after the GNN runs for L layers, the output node embedding h L i is provided as the inputẑ i to the Univariate Decoder (6). Moreover, despite the fully connected assumption, the GNN infers attention weights α ij ∈ (0, 1) (3) for each edge and input sample that "gate" the exchanged messages m i = i =j α ij m ij . This can be interpreted as dynamically inferring the graph. The rea-son is that in GNNs we can write the message aggregation as m i = j∈N (i) m ij = j =j e ij m ij where e ij is 1 if the edge exists or 0 otherwise. We can see that the following expressions i =j e ij m ij ≈ i =j α ij m ij becomes equivalent when the soft estimation α ij = φ α (m ij ) approximates/infers the underlying graph e ij structure from the data. In the experiments section, we name this model a Fully Connected Graph Neural Network (FC-GNN). In the previous section we presented a GNN method that exchanges information among time series under a fully connected graph assumption which computationally scales O(N 2 ). In this section we introduce a bipartite graph assumption (BP-GNN) which reduces the complexity to O(N K), K being a parameter of choice K N . To accomplish this, we define a bipartite graph G = (Y, U, E), where Y is a set of N nodes corresponding to the N time series and U is a set of K auxiliary nodes. Nodes Y have associated the time series embeddings z = {z 1 , . . . z N } and auxiliary nodes U have associated the embeddings u = {u 1 , . . . u K } which are free learnable parameters initialized as gaussian. Edges E interconnect all nodes between the two subsets {Y, U}, but there are no connections among nodes within the same subset. This results in 2N K edges. The message passing scheme works in the following way. We input into the GNN the union of the two node subsets V = Y ∪ U. Specifically, the input embedding h 0 defined in Equation 3 is the concatenation of the time series embeddings z with the auxiliary node embeddings u (i.e. h 0 = z||u). Then, messages follow an asynchronous schedule, first information is propagated from the times series nodes to the auxiliary nodes Y − → U, next the other way around U − → Y. We have conceptually defined the Bipartite Graph Neural Network. In Table 2 we introduce the equations that formally define it as an extension of the standard GNN Equation 3. Step 1 Step Notice that it can be simply formulated as a two steps process where the indexes i, j belong to each one of the subsets U or Y depending on the direction of the messages (Y → U or U → Y). Additionally, we used different learnable parameters between Step 1 and Step 2 in the modules φ e , φ α and φ h since it resulted in better performance than sharing parameters. Following, we define the adjacency matrices corresponding to the two message passing steps (assuming all inference parameters α ij = 1): A 1 refers to Y − → U and A 2 refers to U − → Y. The product of these two matricesà = A 2 A 1 (Appendix 7) defines the sum of all paths that communicate the time series nodes Y among each other through the auxiliary nodes U. Notice that for K > 0 all nodes Y can potentially communicate among them being more efficient than FC-GNN as long as 0 ≤ K < N/2. Recall that our method is composed of three main modules, the encoder f enc , the decoder f dec and the aggregation module AGG. We choose to use relatively simple networks as encoder and decoder. The decoder f dec is defined as a Multi Layer Perceptron (MLP) with a single hidden layer and a residual connection in all experiments. The encoder f enc is also defined as an MLP for METR-LA, PEMS-BAY and our synthetic datasets and as a Convolutional Neural Network (CNN) for the other datasets since these require a larger encoding length and the translation equivariance of CNNs showed to be more beneficial. The encoder f enc , first encodes the input signal x i,t0:t to an embedding vector by using the mentioned MLP or CNN, and then concatenates a unique identifier (c i = id) to the obtained embedding vector resulting in z i . c i could (optionally) include additional context information if it was provided in the dataset. The combination of these simple networks with our proposed aggregation module AGG fully defines our model. The aggregation module AGG was defined under two different assumptions, the Fully Connected Graph assumption (FC-GNN) and the Bipartite Graph assumption (BG-GNN). In both cases the GNN is fully parametrized by the networks φ e , φ h and φ α . φ e consists of a two layers MLP, φ h is a one layer MLP with a skip connection from he input to the output and φ α is just a linear layer followed by a Sigmoid activation function. All architecture choices are explained in more detail in Appendix A.2. We optimized the Mean Absolute Error as the loss L = l(x t+1:T , x t+1:T ) for training. We first evaluate our method on METR-LA and PEMS-BAY datasets (Li et al., 2017) which record traffic speed statistics on the highways of Los Angeles county and the Bay Area respectively. We also consider the publicly available Solar-Energy, Traffic, Electricity and Exchange-Rate data sets. Specifications for each dataset are presented in Table 3 and further detailed in Appendices B and C.1. We compare to the following three main types of baselines: • Univariate: In this case a single model is trained for all time series but they are treated independently without message exchange among them. These baselines include a simple linear Auto Regressive model (AR) and a variation of our FC-GNN method that we called NE-GNN where all edges have been removed such that the aggregation function AGG becomes equivalent to a multilayer perceptron defined by φ h (3). • Multivariate with a known graph: These methods require a previously defined graph, therefore they are restricted to those datasets where an adjacency matrix can be predefined (e.g. METR-LA, PEMS-BAY). From this group we compare to DCRNN (Li et al., 2017) , STGCN and MRA-BGCN (Chen et al., 2020a) . • Multivariate with graph inference or attention: These methods exchange information among different time series by simultaneously inferring relations among them or by attention mechanisms. From this group we compare to LDS (Franceschi et al., 2019) , LST-Skip (Lai et al., 2018) , TPA-LSTM (Shih et al., 2019) , MTGNN (Wu et al., 2020) and GTS (Shang et al., 2021) . Graph WaveNet (Wu et al., 2019 ) also belongs to this group but unlike the others it jointly uses a pre-defined adjacency matrix. Comparisons to NRI (Kipf et al., 2018) can be found in previous literature (Shang et al., 2021; Zügner et al., 2021) . We additionally include a variation of our FC-GNN without unique node ids, we denote it by (w/o id) beside the model name. GTS numbers have been obtained by averaging over 3 runs its official implementation. In this section we evaluate our method in METR-LA and PEMS-BAY datasets. For this experiment we used the training setup from GTS (Shang et al., 2021) which uses the dataset partitions and evaluation metrics originally proposed in (Li et al., 2017) . All our models have been trained by minimizing the Mean Absolute Error (MAE) between the predicted and ground truth samples. The reported metrics are MAE, Root Mean Squared Error (RMSE) and Mean Absolute Percentage Error (MAPE) from (Li et al., 2017) . All metrics have been averaged over 5 runs. All our models (FC-GNN, BP-GNN and NE-GNN) contain 2 graph convolutional layers, 64 features in the hidden layers, Swish activation functions (Ramachandran et al., 2017) reported in Table 5 . We include all methods from the previous Table 4 that have a publicly available implementation in METR-LA and PEMS-BAY datasets. BP-GNN is the most efficient algorithm in both METR-LA and PEMS-BAY by a large margin. FC-GNN is also more efficient than previous methods in both datasets but it is still limited by the O(N 2 ) scalability. In the next Section 5.3 we will see that BP-GNN becomes even more efficient for larger graphs compared to the other methods thanks to its better scalability. In this section, we evaluate our method on the publicly avilable Solar-Energy, Traffic, Electricity and Exchange-Rate datasets. In contrast to METR-LA and PEMS-BAY, these datasets do not contain spatial information from which a graph can be pre-defined, therefore, methods that rely on a known graph are not directly applicable. To ensure comparability, we use the same training settings as (Lai et al., 2018; Shih et al., 2019; Wu et al., 2020) in which the network is trained to predict one time step into the future (single step forecasting) with a given horizon (3, 6, 12, or 24) by minimizing the Mean Absolute Error (MAE). All datasets have been split in 60%/20%/20% for training/val/test respectively. As proposed in (Lai et al., 2018) we use the Root Relative Squared Error (RSE) and Empirical Correlation Coefficient (CORR) as evaluation metrics (both defined in Appendix C.2). All our models (FC-GNN, BP-GNN and NE-GNN) contain 2 graph convolutional layers and 128 features in the hidden layers. The timing results have been obtained with batch size 16. Further implementation details are provided in Appendix C.1. Results with standard deviations are provided in Appendix C.3. Table 6 are consistent with the previous experiment. FC-GNN outperforms all previous works in most metrics. There is a significant improvement over NE-GNN, which demonstrates that sharing information among nodes is beneficial (except in Exchange-Rate dataset which only contains 8 nodes). BP-GNN performs better than previous methods in most metrics and close to FC-GNN. Regarding the timing results (Table 7) , BP-GNN is the most efficient graph inference method by a large margin in most datasets. The larger the number of nodes in the dataset, the larger the computational improvement of BP-GNN w.r.t. other methods due to its linear scalability when K = 4. For example, in Solar-Energy (137 nodes), BP-GNN is 1.43 times faster than FC-GNN and 1.92 times faster than MTGNN. Accordingly, in a larger dataset as Electricity (321 nodes) BP-GNN is 6.38 times faster than FC-GNN and 9.18 times faster than In this section we study the adjacency matrices inferred by FC-GNN. For this purpose, we generated synthetic datasets with different numbers of nodes N and T = 10.000 timesteps each. (a) "Cycle Graph" dataset samples each series value x i,t from the past (t − 5) of another series (i − 1 mod N ) from the panel. The resulting adjacency matrix is a directed cycle graph. More formally, the dataset is generated from the following gaussian distribution x i,t ∼ Since our graph inference mechanism is dynamic we average them over 10 timesteps t. Further implementation details are described in Appendix D.2. Results are reported in Figure 2 for different synthetic datasets which nodes range from N = 10 to N = 250. FC-GNN perfectly infers the ground truth adjacencies in all reported datasets and it also achieves the lowest MAE test loss. For the inferred adjacency matrices in BP-GNN, it is harder to develop a visual intuition given the auxiliary nodes, so we omit them from our exposition. However, note that overall accuracy is on par with FC-GNN (specially for datasets with a dense adjacency) and much improved when compared to NE-GNN. So, BP-GNN allows to leverage cross-time series information effectively. Additionally, we also provide visualizations of the inferred graphs in METR-LA in Appendix D.3 where we demonstrate the dynamic inference behavior of the graph by plotting the inferred adjacency at different time steps t. In all previous sections we set a relatively small number of auxiliary nodes (K = 4) for BP-GNN. In this section we analyze the BP-GNN performance as we sweep over different K values (Figure 3 ). We noticed that for sparser graph datasets as "Cycle Graph", increasing K benefits more than in less sparse graph datasets. In practice, in previous experiments, we chose K to be relatively small since in real world datasets as METR-LA, small K values already provide competitive accuracy with a good trade-off in complexity. In Figure 4 We presented a novel approach for multi-variate time series forecasting which lends itself to easy integration into existing univariate approaches by ways of a graph neural network (GNN) component. This GNN infers a latent graph in the space of the embeddings of the univariate time series and can be inserted in many neural-network based forecasting models. We show that this additional graph/dependency structure improves forecasting accuracy both in general and in particular when compared to the state of the art. We alleviate typical scalability concerns for GNNs via allowing the introduction of auxiliary nodes in the latent graph which we construct as a bipartite graph that bounds the computational complexity. We show that this leads to computationally favorable properties as expected while the trade-off with forecasting accuracy is manageable. Here, we define the adjacency matrices corresponding to the two message passing steps of the bipartite graph for N time series nodes and K auxiliary nodes (assuming all inference parameters α ij = 1). We also present the product of both matrices A = A 2 A 1 . The first N rows ofà can be interpreted as the sum of all paths that communicate the N time series nodes among them after applying the two message passing updates A 2 and A 1 . In this section we will use the shortcut MLP res for a one layer MLP with a residual connection which we define as: Where "nf" represents the number of features. Decoder f dec Given the above MLP res definition, the decoder consists of one residual MLP followed by a linear layer: :T Encoder f enc The encoder for METR-LA and PEMS-BAY consists of a linear layer and two consecutive MLP res . Notice the overall structure can be considered an MLP with residual connections among some of its layers. x In the other datasets (Solar-Energy, Traffic, Electricity and Exchange-Rate) we used the following Convolutional Neural Network as an encoder: Where CNN res follows the same architecture than MLP res but replacing the Linear Layers by Conv1d layers with stride=1 and kernel size=1. In other words, both architectures are equivalent since a CNN with kernel size and stride 1 is equivalent to an MLP that broadcasts over the batch size and sequence length. Formally, we write CNN res as: Input − → {conv1d(nf, nf/2, 1, 1) − → Swish() − → conv1d(nf/2, nf, 1, 1) − → Addition(Input) } − → Output Edge update φ e Consists of two layers MLP. We divide the number of features by two in the intermediate layer for efficiency We can define a linear graph convolutional layer H l+1 = GCL(H, A) as: Where A ∈ {0, 1} N ×N is the adjacency matrix of the graph. H ∈ R N ×nf a matrix of node features with N nodes and "nf" features per node. I ∈ {0, 1} N ×N is an identity matrix and θ e , θ h ∈ R nf×nf the learnable parameters. Table 8 . METR-LA and PEMS-BAY specifications. In this experiment (METR-LA and PEMS-BAY datasets) we used the exact same training configuration as (Shang et al., 2021) . Given the whole training time series panel of dimensionality (N × Length), where Length is the number of samples of the training partition, we construct the input to the network by uniformly slicing windows x t0:t ∈ R N ×12 of length 12. The ground truth labels are obtained in the same way, by slicing the next 12 time steps x t:t+12 ∈ R N ×12 . The same slicing process is done in validaton and test with their respective partitions. The sliced windows are uniformly distributed through the whole time series panel. All our models (FC-GNN, BP-GNN, NE-GNN) have been trained with Adam optimizer, batch size 16 and 2 layers in the GNN module. The number of features "nf" in the hidden layers is 64. The learning rates and number of epochs are provided in the following Table 9 . Time results have been run for batch size 16 in a Tesla V100-SXM GPU. All hyperparameters have been tuned in the validation partition using the following search spaces. The learning rate search space was lr ∈ {1 · 10 −4 , 2 · 10 −4 , 5 · 10 −4 , 1 · 10 −3 , 2 · 10 −3 , 5 · 10 −3 }. The number of features per layer "nf" was chosen among nf ∈ {32, 64, 128}. The number of layers was chosen among {1, 2, 3, 4}. The learning rate decay and the decay factor was not modified and left the same as in the original code from (Shang et al., 2021) . The number of epochs was chosen large enough to do early stopping in the validation partition when the validation loss stops decreasing. Regarding the baselines reported in this experiment, LDS (Franceschi et al., 2019) and STGCN are not evaluated on METR-LA and PEMS-BAY in their original papers, therefore, for STGCN we used the results provided in the WaveNet paper (Wu et al., 2019) which were computed by the same WaveNet authors and for LDS we used the results provided in the GTS work (Shang et al., 2021) . For the DCRNN timing results we used the unofficial Pytorch implementation https://github.com/chnsh/DCRNN_PyTorch. In this subsection we report the standard deviations for METR-LA and PEMS-BAY results. Our models FC-GNN 2.60 ± .02 5.19 ± .06 6.78% ± .12 2.95 ± .02 6.15 ± .08 8.14% ± .16 3.35 ± .03 7.14 ± .09 9.73% ± .27 BP- GNN (K=4) 2.64 ± .01 5.37 ± .03 7.07% ± .08 3.02 ± .02 6.42 ± .05 8.46% ± .08 3.40 ± .02 7.32 ± .05 9.91% ± .19 In this experiment we used the exact same training process as (Wu et al., 2020) . Similarly to METR-LA and PEMS-BAY, the sample construction process, consists of windows uniformly sliced from the time series panel and inputted to the network. Specifically, in this experiment, given the training time series panel of dimensionality (N × Length), where Length is the number of samples of the training partition, we slice windows x t0:t ∈ R N ×168 of length 168 which are the input to our network. In this experiment, we only forecast one time step into the future such that the length of the forecasts and ground truth labels is 1. All our models (FC-GNN, BP-GNN, NE-GNN) have been trained with Adam optimizer. The number of features "nf" in the hidden layers is 128, the number of layers in the GNN is 2. The encoder network in this experiment is a Convolutional Neural Network previously described in Appendix A. Following, we provide a All hyperparameters have been tuned in the validation partition using the following search spaces. The learning rate search space was lr ∈ {5 · 10 −5 , 1 · 10 −4 , 2 · 10 −4 , 5 · 10 −4 , 1 · 10 −3 , 2 · 10 −3 }. The number of features nf was chosen among {32, 64, 128}, the number of layers chosen among {1, 2, 3, 4, 8} by choosing between a trade-off of accuracy and efficiency. The number of epochs was chosen large enough such that the validation loss would stop decreasing. The batch size was set to 4, except for Traffic where it was reduced to 2 in order to fit in memory the more expensive explored configurations. The evaluation metrics for this experiments are exactly the same as for Lai et al. (2018) ; Shih et al. (2019); Wu et al. (2020) . The following equations are extracted from Lai et al. (2018) . The Root Relative Squared Error (RSE) is defined as: The Empirical Correlation Coefficient (CORR) is defined as: In this subsection we report the standard deviations for Solar-Energy, Traffic, Electricity and Exchange-Rate results. In the Inferred graph analysis experiment we presented two synthetic datasets, "Cycle Graph" and "Correlated Sinusoids". Next, we provide a more detailed explanation on how these datasets have been generated: • Cycle Graph: This dataset consists of a panel of N=10 time series of length T=10.000, where each time series i at time step t has been sampled from the past t − 5 of another time series i − 1 mod N from the same panel. The resulting multivariate time series adjacency matrix is a Cyclic Directed Graph where each variable in the panel depends on the previously indexed one. Formally the generation process is written as x i,t ∼ N (βx (i−1 mod N ),t−5 ; σ 2 ) Where β = 0.9 and σ = 0.5. • Correlated Sinusoids: Motivated by the Discrete Sine Transformation. This dataset consists time series of length T = 10.000, each time series generated as the weighted sum of different sinusoids with different frequencies and amplitudes. More formally we define a time series x i as: Where B i,k is sampled from a Uniform distributionB i,m ∼ U(0, 1) and further normalized B i,m =B i,m kB i,m . w i,m is also sampled from a uniform distribution w i,m ∼ U(0, 0.2) and i,t is sampled from a Gaussian distribution i,t ∼ N (0, 0.2 2 ). Finally, B i,m and w i,m are shared among different variables (i.e. time series) in the pannel defining dependencies among them. We choose M = 3. The time series in both datasets have been splitted in train/val/test as 6K/2K2K. In figures 5 and 6 we plot the first 200 timesteps of the training set of "Cycle Graph" and "Correlated Sinusoids" datasets respectively. For "Correlated Sinusoids" we plot the specific case of N = 10 nodes and two clusters with five nodes each C = {5, 5}. In the Inferred Graph Analysis, all models have been trained for 100 epochs, learning rate of 2 · 10 −3 , weight decay 10 −14 . The context length (input length of the sequence) is 6, the prediction length (output length of the sequence) is 1. The Mean Absolute Error between prediction and ground truth has been minimized for training. We also added a small regularization value to the loss R = 10 −8 #edges i,j abs(A i,j ) that pushes unnecessary edges closer to 0 for sharper visualizations. In both BP-GNN and FC-GNN we used the same MLP encoder than the one used METR-LA and PEMS-BAY experiments defined in the Appendix section A.2. The Graph Convolutional layer consists of only 1 layer. Since the edge inference is dynamic we obtained the reported Adjacency matrices by averaging over 10 different t. In this section we analyze the adjacency matrices inferred by our FC-GNN method in METR-LA. METR-LA contains 207 nodes. As in the previous synthetic experiment, we build the FC-GNN adjacency matrices from the inferred values A ij = α ij . We used the exact same training settings as in the main experiment section 5.2, but this time we only used one graph layer in the FC-GNN module from which we obtained the α ij values. The provided adjacency has been averaged over 10 different t values. In METR-LA, sensors are spatially located around a city. Therefore, just for comparison, we also report a matrix built from the distance between each pair of sensors. We call it A sim , where each input A sim [i, j] is proportional to the negative distance between the two sensors A sim [i, j] = bias − dist(i, j). We may expect to see some correlations between this matrix and the inferred one, but this does not have to be the case. Close sensors can be correlated, but also far away sensors can be correlated (e.g. we can expect far away residential areas to have a simultaneous increase in traffic right before working hours). Matrices are presented in the following figure, A sim and the inferred adjacency A. Additionally, our model infers matrices dynamically for each t. This means that the inferred adjacency matrix can change depending on the input x t0:t for different t values. Following, we plot three different matrices for different t values and we see that despite the overall adjacencies share similarities for different t, some components differ as we change t. Sales demand forecast in e-commerce using a long short-term memory neural network methodology Probabilistic demand forecasting at scale Spectral networks and locally connected networks on graphs Spectral temporal graph neural network for multivariate time-series forecasting Multi-range attentive bicomponent graph convolutional network for traffic forecasting Probabilistic forecasting with temporal convolutional neural network Discovering symbolic models from deep learning with inductive biases Normalizing kalman filters for multivariate time series analysis Convolutional neural networks on graphs with fast localized spectral filtering Graph neural network-based anomaly detection in multivariate time series Mqtransformer: Multi-horizon forecasts with context dependent and feedback-aware attention Learning discrete structures for graph neural networks Se (3)-transformers: 3d roto-translation equivariant attention networks Few-shot learning with graph neural networks The dynamic factor analysis of economic time series. Latent variables in socio-economic models Neural message passing for quantum chemistry Forecasting: Principles and practice Criteria for classifying forecasting methods Ai in healthcare: time-series forecasting using statistical, neural, and ensemble architectures Differentiable graph module (dgm) for graph convolutional networks Neural relational inference for interacting systems Semi-supervised classification with graph convolutional networks Self-attention between datapoints: Going beyond individual input-output pairs in deep learning Deep rao-blackwellised particle filters for time series forecasting Modeling longand short-term temporal patterns with deep neural networks Designing and managing the supply chain: concepts, strategies, and case studies Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting Diffusion convolutional recurrent neural network: Data-driven traffic forecasting Elastic machine learning algorithms in Amazon SageMaker Time-series forecasting with deep learning: a survey Temporal fusion transformers for interpretable multi-horizon time series forecasting Principles and algorithms for forecasting groups of time series: Locality and globality Trend analysis of climate time series: A review of methods Forecasting: theory and practice Searching for activation functions Deep state space models for time series forecasting Multi-variate probabilistic time series forecasting via conditioned normalizing flows Autoregressive denoising diffusion models for multivariate probabilistic time series forecasting High-dimensional multivariate forecasting with low-rank gaussian copula processes Probabilistic forecasting with autoregressive recurrent networks E (n) equivariant graph neural networks Structured sequence modeling with graph convolutional recurrent networks Discrete graph structure learning for forecasting multiple time series Temporal pattern attention for multivariate time series forecasting Attention is all you need Graph attention networks Deep factors for forecasting Dynamic graph cnn for learning on point clouds A multi-horizon quantile recurrent forecaster Graph wavenet for deep spatial-temporal graph modeling Connecting the dots: Multivariate time series forecasting with graph neural networks Hierarchical graph representation learning with differentiable pooling Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting Correlation-aware change-point detection via graph neural networks A temporal graph convolutional network for traffic prediction A study of joint graph inference and forecasting