key: cord-0273748-7wmve4fe
authors: Gao, Junyi; Xiao, Cao; Glass, Lucas M.; Sun, Jimeng
title: PopNet: Real-Time Population-Level Disease Prediction with Data Latency
date: 2022-02-07
journal: nan
DOI: 10.1145/3485447.3512127
sha: 9f65cb48de2848bfadca56abcf4337afe501f679
doc_id: 273748
cord_uid: 7wmve4fe

Population-level disease prediction estimates the number of potential patients of particular diseases in some location at a future time based on (frequently updated) historical disease statistics. Existing approaches often assume the existing disease statistics are reliable and will not change. However, in practice, data collection is often time-consuming and has time delays, with both historical and current disease statistics being updated continuously. In this work, we propose a real-time population-level disease prediction model which captures data latency (PopNet) and incorporates the updated data for improved predictions. To achieve this goal, PopNet models real-time data and updated data using two separate systems, each capturing spatial and temporal effects using hybrid graph attention networks and recurrent neural networks. PopNet then fuses the two systems using both spatial and temporal latency-aware attentions in an end-to-end manner. We evaluate PopNet on real-world disease datasets and show that PopNet consistently outperforms all baseline disease prediction and general spatial-temporal prediction models, achieving up to 47% lower root mean squared error and 24% lower mean absolute error compared with the best baselines.

Population-level disease prediction is of great significance to society since early forecasting of new disease counts at each location can help government or healthcare providers better optimize medical resources [28] or inform where to build clinical trial sites [9, 29] . Compared with individual-level disease prediction, which predicts disease risk for each patient based on their health records [2, 4, [10] [11] [12] 20] , population-level disease prediction is usually based on frequently updated online historical disease statistics data collected from certain locations or population groups [8, 9] .

Many machine learning or deep learning models have been developed to leverage patient data for individual disease prediction [2-5, 10, 12, 20] . However, they cannot be applied to population-level disease prediction due to the need for accessing individual patient data. Meanwhile, existing population-level prediction models are mostly developed for infectious diseases. For example, epidemiology models such as the Susceptible-Infectious-Recovered (SIR) model were proposed for population-level infectious disease prediction [17, 27, 37] . Recently, several works further proposed to augment such epidemiology models with deep neural networks for capturing spatial and temporal patterns [8, 9] .

Existing population-level prediction models often assume that their model inputs (e.g., historical disease statistics) are reliable and accurate, which is often not true. In practice, data collection is time-consuming and has time delays, thus disease statistics require continuous updates to become more accurate [7, 30, 34] . Such a data latency issue needs to be considered in population-level predictions. However, tackling this issue is not straight-forward. There are two main challenges:

• Incorporating the updated data into the real-time model.

From the temporal perspective, a real-time model needs to be updated whenever an update is made to the historical data. From spatial perspective, different locations may be updated at different frequencies. We need to handle these idiosyncratic updates in our model. • Extracting data updating patterns. Data latency could be induced by various reasons, for example, geographic and demographic proximity between different locations, which causes the complexity of data updating patterns and brings difficulty for the model to utilize and make predictions. The noise and spatialtemporal correlation of different data streams also add to the difficulties of extracting data updating patterns.

To address these challenges, we propose a population-level disease prediction model (PopNet) which captures data latency and incorporates the updated data for improved predictions. PopNet is enabled by the following technical contributions.

• Dual data modeling systems to incorporate updated data into the real-time model. PopNet models real-time data and updated data using two separate systems, each capturing spatial and temporal effects using hybrid graph attention networks (GAT) and recurrent neural networks (RNN). PopNet then adaptively fuses the two systems using both spatial and temporal latency-aware cross-graph attentions in an end-to-end manner.

To the best of our knowledge, we are the first work to incorporate updated data in spatio-temporal models. • Extract data updating patterns to enrich the spatial and temporal latency-aware attention. We identify three major data updating patterns. (1) Spatial Correlation. Geographically close locations may have similar data updating patterns and locations with similar populations may also have similar characteristics [9, 23] ; (2) Seasonality. The data updating patterns may be temporally periodic, and (3) Disease Correlation. Disease comorbidities may lead to similar updating patterns. We enrich the spatial and temporal latency-aware attentions with these patterns, allowing the model to incorporate these patterns adaptively. • Efficient model update. PopNet can be trained efficiently on the newly added data via better initialization for hidden states of RNN. As a result, PopNet can utilize previous historical patterns without reprocessing old data, which improves efficiency when the training sequences are long and brings convenience for deployment in real-world healthcare systems.

We evaluate PopNet on real-world online medical claims datasets with real-time and update records and a simulated synthetic dataset. Compared to the best baseline model, PopNet achieves up to 47% lower root mean squared error (RMSE) and 24% lower mean absolute error (MAE) on two real-world disease prediction tasks.

Over the years, spatial-temporal prediction models have been developed for application tasks such as traffic prediction [13, 15, 38, 39] , disease prediction [8, 9, 16] , regional demand prediction [38] and general time-series prediction [1] . The recent success of deep learning models, especially GNNs and RNNs, brings promises to better model complex spatial and temporal features. Many research combines graph structures with disease statistics to model regional and temporal disease propagation and achieves more accurate predictions. For example, Deng et al. [8] proposed a location attention mechanism and a graph message passing framework to predict influenza-like illness for different locations. Gao et al. [9] incorporated clinical claims data in graph attention network to predict COVID-19 pandemics and use disease transmission dynamics to regularize RNN predictions. These models achieve good performance on their well-collected datasets. Compared with general spatio-temporal prediction works, our work more focuses on incorporating updated data into the spatio-temporal model. Since in practice, the input data is not always reliable due to latency or errors and may get updated in the future. We believe this scenario is common in web data and real-world settings.

Consider broader spatial-temporal prediction models in other fields such as traffic prediction, most works also utilize graph neural networks to extract spatial features and use RNNs or attention mechanisms to extract temporal features [19, 25, 32, 35, 36] . Those works also do not have the consideration or model design for data latency. For example, the traffic prediction model GMAN [39] leverages the node2vec approach to preserve graph structure in node embeddings and then samples the neighboring nodes to obtain the embedding. Guo et al. [13] proposed ASTGCN to extract multi-scale temporal features by training three network branches to receive hour-level, day-level, and week-level data. In our work, we enrich the model with spatial and temporal background information, making the model adaptively extract spatial relationships of both close nodes and distant but similar nodes, also from multiple time scales.

Definition 1 (Disease statistics data). The disease statistics data are collected from medical claims or online reports of local health departments from different locations. They can be represented as a 3D tensor X ∈ R × × , where denotes the number of locations, is the number of total timesteps, is the number of features (i.e., diseases). Matrix X ∈ R × and X ∈ R × denote slices from the X tensor from time dimension and location dimension. Vector x ∈ R denotes a slice from the X matrix at -th location.

Definition 2 (Updated disease data). The real-time statistics maybe unreliable due to time delays during data collection process, therefore every tensor element in X may be updated at a future timestep. For example, for a specific location at timestep , after we obtain the initial disease statistics for this timestep, we may constantly receive updates for the statistics of timestep in future timesteps + 1, + 2, . . . . All the updated values consist of the updated disease data U ∈ R × × , which is a 3D tensor. Similar to the original disease data, we also use U ∈ R × , U ∈ R × and u ∈ R to denote different slices from the updated data tensor. Here u 1 refers to data updated for location at a future time 1 . Suppose it will replace the original disease data x 0 , note that 1 > 0 , we define this update latency as Δ = 1 − 0 . All update latency is aggregated to a 2D matrix Δt ∈ R × . Value 0 in U means no updates for those tensor elements.

Definition 3 (Location graph). A location graph can be modeled as an undirected graph G = ( , , A), where V is the set of | | = location nodes, is the set of edges, A denotes the adjacency matrix of the graph. The edges are computed based on the geographical and demographic proximity between locations, which will be detailedly introduced in following sections.

Problem 1 (Spatial-temporal disease prediction). Given historical original disease statistics X and updated disease data U, the population-level spatial-temporal disease prediction task is a regression task, which is to predict the future ground-truth number of cases for a certain disease Y ∈ R for all locations at + 1 timestep. We also support multiple-step prediction for next steps from + 1 to + timesteps.

As shown in Fig. 1 , PopNet models real-time data X and updated data U using two separate systems, and then adaptively fuses the two systems using both spatial and temporal latency-aware crossgraph attention. Below we introduce PopNet in more details.

We model the real-time data X and updated data U using separate systems for better modal capacities of pattern extractions in both data sources. We employ graph attention networks (GAT) [33] to leverage spatial relations between locations. This way, the prediction for a target location can be improved by utilizing spatial disease patterns discovered in nearby or similar locations. We use two graph attention networks to process X and U respectively.

Here X and U share the same undirected graph design G( , , A). In graph G, the nodes indicate locations; while the edge connecting node and , denoted as , is the similarity between node and such that = (− ). Here is distance between node and , is the population size of node , and , , are hyper-parameters. We use a threshold value to calculate the graph adjacency matrix as in Eq. (1),

For simplicity, we focus our discussion in this section on a specific timestep and thus will omit the superscript . Accordingly the attention score between node and will be computed as in Eq. (2),

where W ∈ R |z |× , W ∈ R |z |+ |z | , W ∈ R |z |× , W ∈ R |z |+ |z | are attention weight matrices for the GAT networks, denotes the LeakyReLU activation function, and (·|·) denotes the concatenate operation. Then we use softmax function to normalize the obtained attention score as in Eq. (3),

where N ( ) denotes the set of one-hop neighbors of node . Likewise, we use the multi-head attention mechanism [33] to enrich the model capacity by calculating independent attention scores, where is the number of attention heads. We obtain the aggregated node embedding as given by Eq. (4),

(4) where W ∈ R |g |× and W , ∈ R |g |× are the weight matrices for the -th attention head in two GATs, respectively. Therefore, for each node , we will obtain two node embeddings g and g for real-time data X and updated data U respectively.

After obtaining all the node embeddings for updated data and realtime data, we would like to utilize the updated historical data to make better predictions. However, this is not a straightforward task. The latency in data updating can vary between two embeddings of the same node. Hence, directly concatenating or summing two embeddings may confuse the prediction network and lead to inferior prediction results. Besides, there is a latency in the updated data because those locations can be updated at a different frequency. To incorporate these complex latency patterns, we design the spatial latency-aware attention (S-LAtt) to fuse spatial embeddings. The idea of S-LAtt is to use the node embedding as the query to aggregate spatial patterns from nearby or similar nodes (i.e., locations), assuming they have similar data updating patterns. To better quantify such similarity, we learn a spatial information embedding (SIE) v for each node, where the spatial information includes populations, the numbers of hospitals and ICU beds, longitude, and latitude. For node , SIE is obtained via Eq. (5),

where S denotes the spatial information of node . Since these spatial information are in general static, same nodes in both GATs share the same S . Based on the the node embedding and the SIE, we compute the cross-graph attention score as in Eq. (6),

where ∈ N ( ), N ( ) is the neighboring nodes set of node in the location graph G.

In addition to spatial similarity, we also notice that the longer the latency is, the smaller the marginal influence the new data will have on our final prediction, thus we use time latency to regularize this attention score. To be specific, we utilize the temporal latency Δ between node and , and design a heuristic function to use this temporal latency Δ as in Eq. (7),

Then the attention weight is regularized as in Eq. (8),

and we can get the aggregated updated embedding as in Eq (9).

Finally, we concatenate the node embedding g , the aggregated updated embeddingĝ and the original input data x as in Eq. (10) .

In addition to spatial patterns, we also employ the gated recurrent unit networks [6] to extract the temporal patterns based on multivariate time series from each node. To simplify, we focus our It uses spatial latency-aware attention (S-LAtt) to fuse two graphs and generate node embedding for each location. The spatial latency-aware attention is enriched by spatial information embedding (SIE) v learned using location-wise geographical and medical resource features. The node embeddings in two graph networks are fed into two GRUs respectively to extract temporal relations. PopNet also utilizes temporal latency-aware attention (T-LAtt) to fuse temporal embeddings. Similarly, T-LAtt is enriched by temporal information embeddings (TIE)ĉ and c , which can adaptively embed the most informative multi-scale disease patterns to improve predictions. PopNet also aligns the hidden states of GRUĥ , h with the learned TIEĉ and c respectively to achieve efficient update. Finally, PopNet will output predictionsˆ+ 1 using fused temporal embedding.

discussion on one location and omit the subscript node index . The real-time and updated embeddings are fed into GRUs as in Eq. (11),

We use the hidden states of the GRU h and h as the temporal embeddings. Similarly, we design a temporal latency-aware attention mechanism (T-LAtt) to fuse two embeddings and deal with the latency between h and h . As previously discussed, utilizing temporal-related data updating patterns may benefit the predictions. This involves extracting complex temporal patterns such as increasing or declining from different time scales. Besides, we also consider the updating patterns of comorbidities of target diseases. To extract and leverage these patterns, we enrich the T-LAtt with temporal information embeddings (TIE). First, for TIE to extract temporal patterns from multiple timescales, we use dilated convolutional networks [24] with different dilation rates to extract temporal patterns from different time scales. Concretely, at each location, the input disease data sequence X = [x 1 , x 2 , ..., x ] is fed into the CNN as in Eq. (12),

where * denotes the convolution operation, m( , ) is the 1D convolution filter with size and dilation rate . The larger the is, the larger the filter's receptive field is, making the convolution filter extract temporal patterns from a broader time scale. In our experiments, we use a combination of different to extract patterns in different scales, from small to large. The feature maps are concatenated to get the final feature map vector c ∈ R , denotes the number of convolution filters. Each value in c represents an extracted temporal feature. Following previous CNN-based models [12, 14, 21] , we also try to select the most informative patterns in c based on attention weights. Here, we first use mean pooling over time dimension for X as x = (X), x ∈ R . x can be regarded as a summary for diseases. This vector is used to calculate the attention score for the temporal patterns as in Eq. (13),

where denotes the sigmoid activation. We use the multi-layer perceptron to do the mapping R → R , and the sigmoid activation to generate importance score between 0 and 1. The obtained score vector a is used to re-calibrate the feature map vector as in Eq. (14),

The obtainedĉ is the final temporal information embedding (TIE). We can also get the TIE for updated seriesĉ in this way. Similar to the spatial latency-aware attention, we use the TIE to enrich the attention and use time latency between current temporal embedding h and historical updated temporal embeddings [h 1 , h 2 , ..., h ] to regularize the attention score as in Eq. (15),

And the aggregated updated temporal embedding is given by Eq. (16),

Finally, we concatenate the aggregated updated temporal embeddingĥ , the original temporal embedding and the TIE to calculate the final temporal embedding as in Eq. (17),

Model update challenge: In clinical practice, an online populationlevel disease prediction model needs routine updates when new data become available. For model updating with new data, it usually requires model retraining which is time-consuming as the data sequence becomes longer, or directly fine-tuning on the new data which discards historical patterns. Some also initialize a model using new data and the last hidden state of RNN trained using old data. However, this solution assumes there is no large time gap between the two datasets. Otherwise, capturing the continuous behavior of the system becomes more difficult for the model and leads to even worse performance [22] .

Our Solution. PopNet introduces an alignment module to address this issue via providing a better initialization for the hidden states of the RNN on the new data without assuming the data continuity. This is achieved by learning a mapping function between the TIÊ c , c and the hidden states of RNNˆand h at each timestep respectively. When applying to new data, directly using the last hidden states is not optimal due to the new disease patterns may be different. But convolutional features can provide a better initialization for the RNN since they are not strictly sequential-dependent. Concretely, we first use a mapping function parameterized by to map the learned TIE to another latent space as in Eq. (18),

Then we calculate the probability distribution of the mapped TIE embedding and the current hidden state of the RNN using softmax function as in Eq. (19) ,

Then we define the alignment loss function between (ˆ) and ( ) using Kullback-Leibler divergence as in Eq. (20) ,

Note that here we use the Kullback-Leibler divergence since its asymmetric characteristic naturally fits our design: we expect the loss term can help the model learn a close estimation to usinĝ . Besides, since the dimensionality of two embeddings is large, using KL divergence instead of L1 or L2 distance can also help avoid the learned simply mapping the embeddings to random normal distributions. When applied to new data, the model will first calculate the TIE using the entire sequence and then use to provide the initialization for the hidden states of the RNN. The detailed algorithm is shown in Alg.1.

Finally, we use a two-layer perceptron to generate predictions viaˆ+ 1 = (ĥ ). We also let the to make predictions asˆ+ 1 = (h ). Note thatˆ+ 1 is the prediction for the day after the update point, so it may be earlier than current timestep . However, we use this as an auxiliary task to better optimize the GRU and GAT. At testing time, onlyˆ+ 1 is the model output. We use mean squared error as the loss function as in Eq. (21),

We finally optimize the entire model using Eq. (22) . 

We evaluate PopNet by comparing against several spatial-temporal prediction and disease prediction baselines using real-world datasets.

Data We extract disease statistics from patients' claims data in a real-world patient database from IQVIA. The patients' claims data are collected from 2952 counties in the US starting from 2018. We aggregate the ICD-10 codes in claims data into 21 categories, which include 17 diseases and 4 other codes (see detailed category descriptions in Appendix). We use week-level statistics. Since patients' claims data cannot be completed collected at one time, the disease statistics in a certain week will be updated over several weeks. We use the statistics collected in the first week as the real-time data, and we use the data collected from future weeks as the updated data. We conduct experiments to predict two diseases:

(1) Respiratory Disease Dataset: Respiratory diseases include ICD10 codes J00-J99, which are common and most of them are contagious. The number of cases is larger than most other diseases. Therefore, the claims data collection procedure is also longer, so that the disease statistics for one week will be fully collected in the following up to 13 weeks. We filter out locations that have very few cases (less than 100). Finally, we get 1,693 counties for respiratory diseases prediction. (2) Tumors Dataset: Tumors include ICD10 codes C00-D49. Compared to respiratory diseases, the tumors have fewer cases, and the data update period is also shorter. Most statistics of one week can be fully collected in the following 7 weeks. We also filter out locations with very few cases (less than 10), and we get 1,829 counties for tumor prediction.

In addition, we conduct experiments on other 15 diseases and a synthetic dataset generated based on real-world disease update distributions. The detailed statistics and the results can be found in the Appendix. The code and the synthetic dataset is publicly available in 1 .

Baselines We evaluated PopNet against the following spatio-temporal prediction baselines: SARIMAX, GRU [6] , ASTGCN [13] , GMAN [39] , EvolveGCN, ColaGNN [8] and STAN [9] . The detailed descriptions of baselines can be found in Appendix.

We also compare PopNet with the reduced version as the ablation study.

(1) PopNet-LAtt We reduce both S-LAtt and T-LAtt mechanisms from PopNet. PopNet-LAtt is essentially two branches that receive real-time and updated data independently, and the outputs of two networks are concatenated to make final predictions. (2) PopNet-SLAtt We only reduce the spatial latency-aware attention from PopNet.

We only reduce the temporal latency-aware attention from PopNet. (4) PopNet-L We reduce the alignment module and the loss term L from PopNet. Since this term is only related to the iterative training, it will only be evaluated in Q3 section. PopNet-L simply uses normal initialization for RNN hidden states.

Metrics. Following the similar work [9, 33] , we use the following regression metrics to evaluate all the models: The root mean squared error (RMSE), mean absolute error (MAE) and the mean absolute percentage error (MAPE) measures the difference between predicted values and true values:

All metrics are calculated after projecting the values into the real range.

Evaluation Strategy. We split the data into training, validation, and testing sets. 

The prediction results on respiratory disease and tumors are listed in Table 1 . We also conduct two-tailed student's T-test of MAE between PopNet and other baseline models to test the significance of performance improvement. The p-values are also in Table 1 baseline models for all disease categories, which indicates the potential broader utility of PopNet. It is worth noting that for some disease codes, PopNet achieves much better performance than baseline models, for example, musculoskeletal disease or pregnancy prediction. These data often receive more frequent updates due to the large number of cases or the particularity. Therefore, PopNet achieves better performance since it can better extract and utilize update patterns.

This section further explores the performance of PopNet on different locations We report the number of locations that each model has the best performance in Table. 2. Here 'Others' sums up the results of SARIMAX, GMAN, GRU, ASTGCN and EvolveGCN. It is easy to see PopNet achieves the best performance on over 90% of locations for both tasks. For respiratory diseases, PopNet has 12.4% improved MAPE on average than the best baseline. For tumors prediction, PopNet achieves the best performance on 1645 locations, and the average MAPE improvement is 10.7%. Even for the locations that baselines perform the best, the MAPE gap is small.

To evaluate whether the iterative training of PopNet improves the efficiency of model updating, we simulate the following deployment setting: train on the original dataset (week 1-50), and deploy into practice (week 50-80). Then refresh the model using newly collected data during the deployment phase (week 60-80). Finally, we redeploy the model to test the performance (week 80-100). The entire splitting process is shown in Fig. 3 . We report the performance on the final testing phase in Table 3 . PopNet outperforms baselines, achieving 39% lower RMSE, 34% lower MAE, and 25.6% lower MAPE on respiratory disease prediction. and 70% lower RMSE and 49% lower MAE on tumor prediction, compared with the best baseline. Compared with the reduced model PopNet-L , we can see the alignment loss can indeed help improve the predictive performance by providing RNN with a better initialization. Fig. 4 , we compare the iterative training results with the models that are trained on the entire sequence (results in Table 1 ). Compared with the results reported in Table 3 (yellow bars), the results in Table 1 (green bars) are obtained using the same test set The figure shows that the models trained on the entire dataset can achieve lower MAE because the model can access the entire historical data, which makes the models can extract and utilize more historical patterns. However, compared to all baselines and the reduced model, the performance gap of PopNet is much smaller. Compared to the model trained under the iterative training setting, PopNet trained on entire sequences achieves 23% higher MAE on respiratory diseases prediction and only 3% higher MAE on tumors prediction. In comparison, the best baseline model trained on entire sequences achieves 41% higher MAE on respiratory diseases prediction and 53% higher MAE on tumors prediction. This indicates that by aligning the TIE and hidden states of RNN, the model can indeed utilize historical patterns without accessing the original sequences.

Since the entire sequence is four times longer than the iterative training data, training on the entire dataset could be costly for time and memory. For example, for tumor disease, training a regular spatial-temporal model on the entire dataset generally requires more than 12 GB memory and the average training time is about 5 seconds per epoch. But it only requires less than 4 GB memory and 2 seconds per epoch to train the same model on the iterative dataset. PopNet can achieve almost equivalent performance using just iterative training data, which can be useful for real-world applications and efficient for long-sequence data.

Long-term prediction is also significant for disease prediction in practice. In this work, we also explore the capability of PopNet for long-term disease prediction. We change the output size to make PopNet and other baseline models predict future 5 weeks. We report the performance and the p-value of MAE in Table 4 .

The results show that the MAE rises as the prediction window increases since it becomes more difficult to predict longer future trends. However, PopNet still has the lowest MAE increase ratio.

For respiratory diseases, the MAE of PopNet increases 14% as the prediction window length increases from 1 to 5, while the baseline model STAN increases 24% and ColaGNN increases 31%. For tumors, the MAE of STAN and ASTGCN increase 11% and 24%, respectively, while PopNet only increases 6%. The results show that PopNet can consistently outperform all other baseline models under different lengths of prediction window. A longer prediction window has less effect on the predictive performance of PopNet. To shows that compared to baseline models, PopNet can achieve a significantly smaller error gap as the prediction window increases on two diseases. This indicates that PopNet is also suitable for long-term prediction tasks.

In this work, we propose PopNet for real-time population-level disease prediction with considering data latency. PopNet uses two separate systems to model real-time and updated disease statistics data, and then adaptively fuses the two systems using both spatial and temporal latency-aware cross-graph attention. We augment the latency-aware attention with spatial and temporal information embeddings to adaptively extract and utilize geographical and temporal progression features. We also conducted extensive experiments across multiple real-world claims datasets. PopNet outperforms leading spatial-temporal models in all metrics and shows the promising utility and efficacy in population-level disease prediction.

In future works, we will use more flexible way to generate better location graph instead of using hard defined edge weights, which is the major limitation of this work.

This work was supported by IQVIA, NSF award SCH-2014438, PPoSS 2028839, IIS-1838042, NIH award R01 1R01NS107291-01 and OSF Healthcare.

In this section, we report the basic statistics and disease categories in the real-world claims dataset. The patients' claims data are collected from 2952 counties in the US starting from 2018. We aggregate the codes into 17 disease categories (A00-Q99) and 4 other categories (R00-Z99) according to the ICD-10 coding. The detailed disease category is reported in Table 5 . Due to space limitations, we only report the detailed data statistics of two diseases (i.e., tumors and respiratory diseases) reported in the main text. The statistics are shown in Table 6 . The spatial features include populations, number of hospitals, number of ICU beds, longitude, latitude and annual income.

We construct a synthetic dataset from a real-world disease dataset. We first randomly aggregate the data in the real-world dataset from different locations to generate data sequences for the real-time data and prediction targets. For each location, We randomly aggregate data from 1-5 neighboring locations. Then for each timestep, we use up-sampling and down-sampling to aggregate data from 1-3 continuous timesteps while keeping the length of data does not change. Then we add random Gaussian noise ( = 0, = 1) to the aggregated data. For the updated data, we assume all locations are updated at regular intervals for the sake of simplicity. We use the 

We also conducted experiments on the artificially generated synthetic dataset, and report results in Table 9 . From the results, PopNet outperforms all baselines with a = 0.001 significance level. Compared with the best baseline ColaGNN, PopNet has 19.5% lower RMSE, 15.5% lower MAE, and 4% lower MAPE. The SARIMAX model does not perform well on the synthetic dataset since autoregression models are difficult to fit random noises in the data. 

All methods are implemented in PyTorch [26] and trained on an Ubuntu 16.04 with 64GB memory and a Tesla V100 GPU. We use Adam optimizer [18] with a learning rate of 0.001 and trained for 200 epochs. For hyper-parameter settings of each baseline model, our principle is as follows: For some hyper-parameter, we will use the recommended setting if available in the original paper. Otherwise, we determine its value by grid search on the validation set.

• SARIMAX stands for seasonal autoregressive integrated moving average, which is a popular time series prediction model. SARIMAX considers seasonal influence with exogenous variables, making it more suitable for our disease prediction task. We use grid-search to determine the hyperparameters of the model at each location. • GRU. We use GRU to conduct temporal prediction without considering the spatial relationships. GRU model cannot utilize spatial relationships and locations are regarded as independent samples to train the GRU model. The hidden units of the GRU cell are set to 128.

• GMAN is a recently published spatial-temporal prediction model for traffic prediction. It uses an encoder-decoder structure with spatial-temporal attention to predict future traffic status. The number of attention blocks is set to 3, the dimensionality of each attention head is set to 64, and the number of attention head is 4. • ASTGCN is a recently published spatial-temporal prediction model for traffic prediction. It applies additional convolutional layers and attention mechanisms on GCN. The number of convolutional kernels is set to 64, and the kernel size is set to 3. • EvolveGCN is a general spatial-temporal prediction model. It adapts the graph convolutional network (GCN) model along the temporal dimension without resorting to node embeddings and uses an RNN to evolve the GCN parameters. The hidden units of the GRU cell are set to 128, and the dimensionality of GNN is set to 64. • STAN is a hybrid deep learning and epidemiology spatial-temporal model for epidemic and pandemic prediction. STAN also constructs a location graph based on geographic similarity and uses graph attention network and RNN to predict future cases. Since we do not constraint our prediction target is an infectious disease, we remove the disease transmission dynamics regularization in STAN. The dimensionality of GAT is set to 64 for respiratory diseases prediction and tumors prediction, 128 for the synthetic dataset. The number of hidden units of GRU cell is set to 128. • ColaGNN is a spatial-temporal pandemics prediction model, which uses a location graph to extract spatial relationships for predicting pandemics. The number of hidden units of GRU cell is set to 128. The number of convolutional kernels is set to 64, and the kernel size is set to 3. • PopNet. The , , is set to 0.35, 0.37, 30. We set the kernel size of convolutional layers to 16 and kernel size to 3. We use a set of dilation rate = [1, 3, 5] . The dimensionality of the GAT layer and attention head is set to 32. We use 2 attention heads for respiratory diseases prediction and tumors prediction, 1 for the synthetic dataset. The hidden units of GRU are set to 256. The dimensionality of MLP is set to 128. We also use a dropout layer [31] before the output layer to prevent overfitting. The dropout rate is set to 0.5.

Spectral temporal graph neural network for multivariate time-series forecasting

Doctor ai: Predicting clinical events via recurrent neural networks. In Machine learning for healthcare conference

GRAM: graph-based attention model for healthcare representation learning

Using recurrent neural network models for early detection of heart failure onset

Mime: Multilevel medical embedding of electronic health records for predictive healthcare

Empirical evaluation of gated recurrent neural networks on sequence modeling

Impact of reporting delay and reporting error on cancer incidence rates and trends

Cola-GNN: Cross-location Attention based Graph Neural Networks for Long-term ILI Prediction

Stan: Spatio-temporal attention network for pandemic prediction using real world evidence

Camp: Co-attention memory networks for diagnosis prediction in healthcare

Dr. Agent: Clinical predictive model via mimicked second opinions

StageNet: Stage-Aware Neural Networks for Health Risk Prediction

Attention based spatial-temporal graph convolutional networks for traffic flow forecasting

Squeeze-and-excitation networks

LSGCN: Long Short-Term Traffic Prediction with Graph Convolutional Networks

Examining covid-19 forecasting using spatio-temporal graph neural networks

A contribution to the mathematical theory of epidemics

Adam: A method for stochastic optimization

Predicting dynamic embedding trajectory in temporal interaction networks

Dipole: Diagnosis prediction in healthcare via attention-based bidirectional recurrent neural networks

Adacare: Explainable clinical health status representation learning via scale-adaptive feature extraction and recalibration

State initialization for recurrent neural network modeling of time-series data

Global biogeography of human infectious diseases

Wavenet: A generative model for raw audio

Evolvegcn: Evolving graph convolutional networks for dynamic graphs

Pytorch: An imperative style, high-performance deep learning library

Initial Simulation of SARS-CoV2 Spread and Intervention Effects in the Continental US

CPAS: the UK's national machine learning-based hospital capacity planning system for COVID-19

Mathematical modeling of infectious disease dynamics

An improved approach to accounting for reporting delay in case surveillance systems

Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research

Dyrep: Learning representations over dynamic graphs

Graph attention networks

Understanding reporting delay in general insurance

Inductive Representation Learning in Temporal Networks via Causal Anonymous Walks

Inductive representation learning on temporal graphs

Modified SEIR and AI prediction of the epidemics trend of COVID-19 in China under public health interventions

Modeling spatial-temporal dynamics for traffic prediction

Gman: A graph multi-attention network for traffic prediction