key: cord-0696544-06gw3sfp authors: Er, Siawpeng; Yang, Shihao; Zhao, Tuo title: COUnty aggRegation mixup AuGmEntation (COURAGE) COVID-19 prediction date: 2021-07-12 journal: Sci Rep DOI: 10.1038/s41598-021-93545-6 sha: 64aa85b6ad67373fdf657da536c3948f4efbfeed doc_id: 696544 cord_uid: 06gw3sfp The global spread of COVID-19, the disease caused by the novel coronavirus SARS-CoV-2, has casted a significant threat to mankind. As the COVID-19 situation continues to evolve, predicting localized disease severity is crucial for advanced resource allocation. This paper proposes a method named COURAGE (COUnty aggRegation mixup AuGmEntation) to generate a short-term prediction of 2-week-ahead COVID-19 related deaths for each county in the United States, leveraging modern deep learning techniques. Specifically, our method adopts a self-attention model from Natural Language Processing, known as the transformer model, to capture both short-term and long-term dependencies within the time series while enjoying computational efficiency. Our model solely utilizes publicly available information for COVID-19 related confirmed cases, deaths, community mobility trends and demographic information, and can produce state-level predictions as an aggregation of the corresponding county-level predictions. Our numerical experiments demonstrate that our model achieves the state-of-the-art performance among the publicly available benchmark models. Epidemic prediction is a time series prediction problem. Given a sequence of time with corresponding data, a model needs to predict the target incidence in the future time. There are four main classes of predictive models for epidemic prediction: the compartmental model, simulation modeling, statistical model, and deep learning model. Moreover, for the same class of models, the final prediction may be an ensemble of several other models. In fact, on the CDC website, the final CDC prediction is obtained by ensembling all the submitted models 18 . • Compartmental model is one of the most widely used model types for modeling epidemic diseases. It characterizes the disease spread dynamics using systems of ordinary differential equations. One of the most successful compartmental models is the SIR model, which is used to predict disease progression within the population in one area. In the SIR 25,26 model, the population is assigned to Susceptible (S), Infectious (I), or Recovered (R) mode. One variant of SIR model is SEIR model [27] [28] [29] [30] , which introduces additionally Exposed (E) mode. Other variants of the SIR model include the SIRD 31 model, with additional Deceased (D) mode. Any transitions from one mode to another mode (i.e., the disease spreading dynamics) are modeled as differential equations, often in the form of a transition matrix. Compartmental models are hard to be widely used due to the difficulties to determine the hyperparameters in every differential equation used 32 . One highly successful compartmental model for COVID-19 predictions is from Karlen's group 33 . Karlen's model uses discrete-time difference instead of ordinary differential equations to model the transition matrix. • Simulation modeling uses computer simulation to model different components in the studied environment and observe their interactions. Cellular automata and agent-based simulation are two simulation modeling techniques used to model complex systems 34 . In COVID-19 prediction, several groups 13, 14 use agent-based simulation due to its flexibility to simulate dynamic behaviours of systems with large number of entities. While highly flexible, agent-based simulation requires access to extensive computational resources. Besides, multiple simulations are needed for statistically sound observations, resulting in longer inference time. This limitation is noticeable when the simulated systems involve a large number of individual entities. • Conventional statistical models use regression methods to fit the data directly. Such models include ARIMA, Gaussian process regression, and linear regression, which are more flexible than compartmental models. However, most of the time, statistical model usage is limited by the need for more sophisticated hand-crafted features, which often requires knowledge from the domain experts. For example, CLEP model 11 www.nature.com/scientificreports/ capability, such models need a less sophisticated handcrafting preprocessing of the input data. In time series prediction problems, some common deep learning models include Long short-term memory (LSTM) 24 , Gated Recurrent Unit (GRU) 35 , and transformer 36, 37 . All the above models can capture intrinsic information from sequential data for accurate prediction. One limitation of such models is the need for large training data. Concurrently with our work, there are other deep learning models including models from Refs. 16, 17 that utilize attention mechanism from transformer architecture. Different models have their merits in their performance for different date ranges. During the beginning of the COVID-19 outbreak, due to the limitations of the available data, we see predictions from the compartmental models or statistical models. With more data available, deep learning models are showing their advantages in the model flexibility and prediction accuracy. It is also more challenging for a model to make accurate predictions as the prediction granularity becomes finer. Intuitively, errors are more likely to be accumulated if a model is tasked to predict more targets than only a few targets. However, finer prediction granularity, such as prediction at the county level, is highly desired since local predictions help policymakers make tailored policies based on individual counties. Our contribution. In this paper, our goal is to predict the weekly total number of deaths at both county level and state level for the next 2 weeks, given the current week data. Each single-day data include the number of confirmed cases, the number of deaths, community mobility, and population. Our novelty is to connect established Natural Language Processing techniques with the COVID-19 time series prediction. In particular, we build a self-attention model, also known as the transformer model in Natural Language Processing, that is able to capture both the short and long term dependencies within the time series input data. Our model is build upon two main ideas. First, by aggregating the accurate predictions from county level to state level, our model shows strong performance for the state-level prediction task in all prediction periods. Then, we further improve our model, by experimenting with the feasibility of using data augmentation method. We implement mixup 23 as our data augmentation method at the input layer. To the best of our knowledge, this is the first application of mixup data augmentation in COVID-19 data. Data augmentation further improves our model performance, and is particularly helpful when new trends emerge. Using these two core ideas, our proposed method, named COUR-AGE (COUnty aggRegation mixup AuGmEntation), is able to generate a short-term prediction of 2-week-ahead COVID-19 related deaths for each county and state in the United States. When compared with other benchmark models, COURAGE shows strong performance across different periods, showing its strength and usability for the prediction of the COVID-19 related number of deaths. To evaluate our proposed method, we compare county-level and state-level predictions using both the countylevel and state-level testing sets with mean absolute error (MAE) as our comparison metrics. For county-level predictions, we compare the predictions from our COURAGE model with its two member models-the County model and the Mixup model. We also compare our model with the baseline Naive model, which simply uses the previous week's reported total number of deaths as the prediction. In the state-level predictions, besides the previous mentioned models, we have an additional baseline State model, which is COURAGE prediction generated solely on the state-level input data without any county-level information or the mixup data augmentation. We compare the predictions according to the training period used, corresponding to 0.5, 0.6, 0.7, and 0.8 of the total dataset. We also present the performance of each model across multiple non-overlapping periods. Finally, we compare our models with the available models contributed to the CDC forecast website. Comparison among models. We summarize our comparison among our models in Table 1 . We emphasize that our COURAGE model is an ensemble model that consists of two separate models: the County and the Mixup model. Those two can also serve as standalone models. We have the Naive model (a.k.a. persistence forecast model, which uses the previous week's reported number of deaths as the forecast) as a baseline model in the county-level prediction task. We use both the Naive model and the State model (prediction using only state-level inputs) as baseline models for the state-level prediction task. The County model and the Mixup model produce more accurate predictions than the Naive model in the county-level prediction task. By combining these two models, we improve our COURAGE model's prediction accuracy. We also get accurate predictions from the County model and the Mixup model in the state-level prediction task. They outperform both the Naive model and the State model. When combining both models, COURAGE has the best accuracy under different prediction periods, shy from the Week 2 predictions when using the training dataset dated from March 7, 2020, to December 1, 2020. We observe that the County model and the Mixup model have their strengths for different periods. Our COURAGE model often obtains the best of both models and produces the most accurate predictions. Predictions for different periods. We show the number of deaths prediction for different periods in Table 2 . Since we use different amount of datapoints as our training set, our training data consists of information across different periods. We take the non-overlapping period from each testing set as a separate out of the sample prediction period. In all the periods, the County and Mixup models produce better predictions than the Naive model. The result shows the feasibility of both models. For the Mixup model, its strength is visible in the last two periods of prediction. In the last prediction period, we use our trained model (using training set with data from 2020-03-07 to 2020-12-01) to predict the latest data with a prediction period from 2021-01-18 to 2021-03-14. Our model is able to predict accurately for the new data, as illustrated by plot of prediction for New York in Fig. 1 Scientific www.nature.com/scientificreports/ and additional plots for other major states (Illinois, California, Texas, Arizona) in Supplementary Information. We mark in Fig. 1 with light grey lines for different prediction date ranges as that of Table 2 . In the first two periods, the number of deaths is relatively low and stable. There is an increase in the number of deaths in the third period. Finally, we can see a huge increase or decrease in the number of deaths in the last two prediction periods. In scenarios with relatively stable trends, the County model provides better predictions than that of the Naive Table 3 . Then we rank each model using the average MAE obtained for both Week 1 predictions and Week 2 predictions. Due to the extensively long list of models, we only show results from the top 10 models in both tables. From Table 3 , we can observe that our County model produces competitive accuracy for the number of deaths prediction. Our County model is a top 10 model in terms of prediction accuracy. When we augment our dataset, our Mixup model further improves the prediction accuracy. We wish to examine which period mixup data augmentation contributes the most to model accuracy. Mixup data augmentation improves our model when a new trend emerges and helps our model to achieve better prediction in the last period, as illustrated in Table S1 of Supplementary Information. By combining strengths from the County and Mixup models, our COURAGE provides a good balance across different prediction periods. From our results, we see that our COURAGE model gains its strengths from both member models, which are the County model and the Mixup model. The two core ideas associated with these member models are the aggregation of the high-quality county predictions (County model) and data augmentation (Mixup model). Table 2 . The last dashed vertical line marks the prediction period of recent data using our last trained model. "Target" is the true reported number of deaths of New York. More plots for other major states are presented in Supplementary Information. www.nature.com/scientificreports/ The aggregation of high-quality county predictions results in high-quality state predictions. We could see high-quality county-level predictions from County and Mixup methods from Table 1 . This result shows our model and training are effective in extracting and utilizing the information from the input data. When we aggregate these accurate county-level predictions, our models can produce state-level predictions that outperform baseline models. As an illustrative comparison, the State model (baseline model) uses only state-level data as input. Such input is limited and of coarser grain, making it harder for the State model to produce accurate predictions. As a result, the State model's performance pales when compared with our model, as well as the Naive model. One way to refine state-level prediction is through the use of county-level data. By training our models using the larger county-level dataset, we obtain accurate county-level predictions. When we sum these high-quality county-level predictions, we obtain accurate state-level predictions. We justify our intuition from state-level predictions of Table 1 , by showing a strong result from our model. The data augmentation from Mixup method also played a significant role in improving the accuracy of our predictions. Due to the fast-evolving dynamics of COVID-19, most models that submit their predictions to the CDC have high accuracy only for a certain period. Existing trends are much easier to fit than emerging trends, and any changes of an existing trend are hard to predict due to the scarcity of data available at the onset of changes. The fundamental reason for prediction deterioration is the lack of visibility of the new trend given the limited data available. If we could increase the amount of available data by incorporating new data, predictive models could be improved when a new trend emerges. However, data generation is hard for any new trend. One way to improve the model's prediction on emerging trends is to improve its generalization with augmented input data. Our Mixup model uses a well-proven data augmentation technique to create new input data. Such training helps our model achieve accurate predictions in the emergence of unseen data. By combining the strength of each member model (the County model and the Mixup model), we obtain the COURAGE model. Our COURAGE model obtains its prediction by averaging County prediction and Mixup prediction. This simple ensemble method allows our COURAGE model to decrease the variance in different models and achieve a balanced model. While our COURAGE model shows strong results, one limitation of our current model is that the selfattention matrix from the encoder is not easily translated to an explainable pattern. Our model's predictions use information from the confirmed cases, deaths, population, and mobility data. Their interaction is encoded by the encoder, with self attention as a key mechanism. Hence, it will be beneficial to the research community if we could see any important relationship among these inputs through the attention matrix. Besides, our model is a relatively concise model that leverages temporal data. In our future work, we plan to extend our work to include spatial information such as interaction among counties or major cities. We expect that the inclusion of such geographical information would further improving our model's predictions. Currently, our model predicts 2 weeks ahead predictions. We plan to include predictions of a longer horizon up to 4 weeks ahead predictions, and also generate a probabilistic forecast that explicitly accounts for forecast confidence. In this section, we present our data sources and data processing used in this paper. We also present details of our transformer-based model, mixup data augmentation technique, and training procedure. Data sources. We use three comprehensive datasets in this study, including confirmed cases, deaths, population, and community mobility from two sources. We only focus on states in the mainland of the United States and do not consider Hawaii, Alaska, and other unincorporated territories in this paper. We use data from 47 states and 3206 counties. Confirmed cases and deaths of Covid- 19 We use data from the JHU CSSE Covid-19 dataset 6 . This publicly available data is a curated dataset from different sources. The data used is collected from January 22, 2020, to February 7, 2021. We use confirmed cases and deaths from the dataset for every county from 47 targeted states. Community mobility Mobility data have been shown to help the risk analysis of COVID-19 and enhance the predictions of the COVID-19 related deaths and cases 21, [52] [53] [54] [55] [56] . In current article, we use Google mobility data 8 to incorporate mobility information in our model. These reports record a community's daily movement by county for different areas such as retail and recreation, groceries and pharmacies, parks, transit stations, workplaces, and residential. There is a baseline day's value for every day of a week, which represents the usual community mobility value. The baseline day's value is the median of the data from January 3, 2020, to February 6, 2020. Google mobility data reports the movement pattern of the US community. Each day's data is in the percentage of changes from the baseline day's value. A negative value for a particular day indicates a decrease in time spent at the area from the baseline day, while a positive value indicates an increase in mobility from the baseline day. Demographic information We use population information from the JHU CSSE Covid-19 dataset 6 which includes population data for each corresponding county. Data preparation. We retrieve features required for our model training, including the number of confirmed cases, the number of deaths, population, and mobility information from our datasets. We consolidate all input data into state-level and county-level datasets. We also include smoothed (average over past 7 days) confirmed cases and deaths as our input features. We separate total data into training and testing datasets for each corresponding level. To test our method for a different period, we try different amounts (50%, 60% and 70%, and 80%) of the total data as our training dataset, leaving the remaining data as our testing dataset. Since we use multiple features as inputs, we apply standardization to the inputs to accommodate differences in scale for each input. We also use additional data from February 8, 2021, to March www.nature.com/scientificreports/ using 80% of the data in order to examine whether our trained model can continue to perform well for a new period without additional training. Transformer-based model. The prediction of COVID-19 related number of deaths given a sequence of input is a time series modeling problem. In a typical time series prediction setting, given a sequence of previous days' number of deaths, the aim is to predict the number of deaths for the next future day. For our current article, our prediction problem is different from this typical time series problem. In the current article, our input consists of the current week's number of deaths, number of confirmed cases, smoothed (average over 7 days) number of confirmed cases, smoothed number of deaths, community mobility data, and population of the area. More importantly, instead of predicting the daily number of deaths, our model predicts the weekly total number of deaths for the next 2 weeks (Week 1 and Week 2), given the current week (Week 0) input data. We join the current week input information together to form a data vector of dimension K. The input for our COVID-19 prediction problem can be viewed as a sequence of K-dimensional vectors. Suppose we are given a sequence S = {k j } L j=1 of L days data, where each single-day data k j ∈ R K , occurs at time j. The key ingredient of our transformer-based model is the self-attention module. Different from RNNs, the attention mechanism does not have recurrent structure. In order to incorporate the temporal information into the inputs, we use the original positional encoding method 36 to our data vector. Alternatively, we could also use other positional encoding methods such as relative position 57 to provide the temporal information for each of single-day data vector in our input sequence. Before passing to the attention module, we first transform our sequence of single-day data vectors using a matrix U ∈ R M×K . After the transformation, for any single-day data x j and its corresponding time stamp j, the temporal vector z j and the single-day data vector Uk j both reside in R M . Given a sequence of L days data S = {k j } L j=1 , we get where E = [k 1 , k 2 , . . . , k L ] ∈ R K×L is the sequence of single-day data vectors, Z = [z 1 , z 2 , . . . , z j ] ∈ M × L is the concatenation of the temporal vectors. We pass X through the self-attention module. Specifically, we compute the attention output S by Here, Q , K , V are the query, key and value matrices obtained by different linear transformations of X , and W Q , W K ∈ R M×M K , W V ∈ R M×M V are weights for the respective linear transformations. In practice, multi-head self-attention increases model flexibility and is beneficial for data fitting. In multi-head self-attention, different sets of weights where W O ∈ R HM V ×M is an aggregation matrix. We highlight that the self-attention mechanism allows the selection of any single-day data whose occurrence time is at any distance from the current time. The j-th column of the attention score from the Softmax (QK T / √ M K ) indicates the extent of dependency of j-th single-day data ( k j ) on its history. Hence, attention mechanism allows the capturing of short and long term dependencies of the sequence data. On the other hand, RNN-based models encode the data's history sequentially via hidden representations of events, where the state of j depends on that of j − 1 , which in turn depends on j − 2 , etc. If the RNN fails to learn sufficient information for single-day data at j, subsequent hidden representation of any other single-day data at t where t ≥ j will be adversely impacted. The attention output S is then fed through a position-wise feed forward neural network, generating a hidden representation h(j) of the input data sequence: are the corresponding weights and biases of the feed forward neural networks. The resulting matrix H ∈ R L×M contains hidden representations of all the information in the input sequence, where each row corresponds to a particular information. We use this final representation as an input to our linear decoder layer and obtain our predictions of the weekly total number of deaths for next 2 weeks. In a typical time series prediction setting, the number of deaths prediction only forcasts the next day given the current week data. In such typical time series prediction, we need to implement masking for the attention mechanism to prevent "peeking into the future" issue. The masking allows any j-th data to attend only to any t-th data where t ≤ j . In the current article, our model is predicting the weekly total number of deaths for the next 2 weeks (Week 1 and Week 2), given the current week (Week 0) input data. This setting frees us from such masking requirement since the model is implicitly masked from accessing the future total number of deaths from the current week data. A transformer based model allows us to stack multiple self-attention modules together, and inputs are passed through each of these modules sequentially. In this way our model is able to capture high level dependencies. www.nature.com/scientificreports/ We remark that stacking RNN/LSTM are susceptible to gradient explosion and gradient vanishing, rendering the stacked model more difficult to train. Figure 2 illustrates the architecture of our transformer-based model used in this project. Mixup data augmentation. Data augmentation is a commonly used technique to improve the deep learning models' generalization. One recent augmentation method is mixup 23,58,59 data augmentation. In mixup data augmentation, given X as the input space of total training data and Y as the corresponding output values space, each training set is a pair of (x i , y i ) . Mixup data augmentation constructs new data by interpolating it from existing data. where (x i , y i ) and (x j , y j ) are two examples drawn at random and ∈ [0, 1] . In our model, we use mixup data augmentation at input layer. Training objective. We train the transformer model using the Huber loss function. Specifically, the training objective is defined as X and Y are the input space and the target space, with a pair of testing sample as (x i , y i ) . We have n samples, and δ is a tuning hyperparameter. Training details. We use the transformer-based model (refer to "Transformer-based model" section) for predicting both county-level and state-level number of deaths. Specifically, we use a transformer encoder of 32 model dimensions, 1 encoder layer with 8 attention heads and 64 feedforward dimensions. The output of the encoder layer connects to a single linear layer decoder for predicting both weekly total number of deaths for next 2 weeks (Week 1 and Week 2) using the current week input data. We use Adam 60 optimizer for its superior empirical performance in training a neural network and set 0.001 as our initial learning rate. For every 100 epochs, we decay the learning rate by half for a total of 500 epochs of training. Figure 3 illustrates our training process. We use both the state-level and county-level training sets to train our models. We use the smoothed number of deaths as our prediction target during training. Upon completion, all the models are used to predict weekly total number of deaths for the next 2 weeks (Week 1 and Week 2) using county-level dataset. We sum all the county predictions of the corresponding state as the predictions of that state. We denote the model trained as the County model. For the Mixup model, there is an additional mixup data augmentation applied to the input layer during the training phase. Our COURAGE model is an ensemble model of two member models, the County In summary, this article presents the new model COURAGE for COVID-19 predictions at county level and state level for the United States. We use county-level data to train COURAGE and obtain state-level predictions through aggregating high quality county-level predictions. We improve our model using mixup augmentation and ensemble predictions from both the County and Mixup models as our final output. To the best of our knowledge, our model is the first model that use mixup data augmentation to improve the accuracy of COVID-19 related number of deaths prediction. Our experiment shows that this new application of mixup data augmentation helps improve the model's prediction accuracy when new trends occur. COURAGE is a flexible model, that each member model (the County model and the Mixup model) can be used as a standalone model to produce accurate predictions in different periods. When both member models are ensembled together, COUR-AGE achieves accurate predictions across all periods. COVID-19 is a serious crisis affecting our daily life and economy. Accurate predictions of disease dynamics is a challenging task, especially when new trends emerge. We hope that through our new training method, we can improve COVID-19 number of deaths predictions and provide insight for resource allocation and disease control planning. www.nature.com/scientificreports/ CDC data tracking COVID-19 Economic Crisis Long-Term Effects of COVID-19 Latest Map and Case Count The New York. Coronavirus (Covid-19) Data in the United States An interactive web-based dashboard to track COVID-19 in real time COVID Tracking Project Google COVID-19 Community Mobility Reports Interpretable sequence learning for COVID-19 forecasting Curating a COVID-19 data repository and forecasting county-level death counts in the United States Tracking COVID-19 using online search. npj Digit Covasim: An agent-based model of COVID-19 dynamics and interventions Using an agent-based model to assess K-12 school reopenings under different COVID-19 spread scenarios-United States, school year 2020/21 DeepCOVID: An operational deep learning-driven framework for explainable real-time COVID-19 forecasting STAN: Spatio-temporal attention network for pandemic prediction using real-world evidence Inter-series attention model for COVID-19 forecasting Ensemble forecasts of coronavirus disease 2019 (COVID-19) in the U Overview of prediction flow. The county-level and state-level predictions for weekly total number of deaths are for the next week (Week 1) and the second week (Week 2) from the current week New York Severely Undercounted Virus Deaths in Nursing Homes Identifying US countries with high cumulative COVID-19 burden and their characteristics High-resolution Spatio-temporal Model for County-level COVID-19 Activity in the Real-time, interactive website for US-county-level COVID-19 event risk assessment mixup: beyond empirical risk minimization Long short-term memory Exact analytical solutions of the Susceptible-Infected-Recovered (SIR) epidemic model and of the SIR model with equal death and birth rates A time-dependent SIR model for COVID-19 with undetectable infected persons The mathematics of infectious diseases Modified SEIR and AI prediction of the epidemics trend of COVID-19 in China under public health interventions Forecast analysis of the epidemics trend of COVID-19 in the USA by a generalized fractional-order SEIR model Management strategies and prediction of COVID-19 by a fractional order generalized SEIR model Chinese and Italian COVID-19 outbreaks can be correctly described by a modified SIRD model The Limits to Learning a Diffusion Model Characterizing the spread of CoViD-19 Introduction to the Modeling and Analysis of Complex Systems (Open SUNY Textbooks Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation Attention is all you need Transformer Hawkes Process COVID-19 Modeling Spatiotemporal Dynamics, Nowcasting and Forecasting of COVID-19 in the United States MOBS lab Analysis of the COVID-19 Epidemic Fast and Accurate Forecasting of COVID-19 Deaths Using the SIkJα Model Parameter estimation from ICC curves DeepGLEAM: a hybrid mechanistic and deep learning model for COVID-19 forecasting Risk mapping for COVID-19 outbreaks in Australia using mobility data Efficiency of communities and financial markets during the 2020 pandemic Mobility-based prediction of SARS-CoV-2 spreading Trade-offs between mobility restrictions and transmission of SARS-CoV-2 Time dynamics of COVID-19 Self-attention with relative position representations Augmenting Data with Mixup for Sentence Classification: An Empirical Study Manifold mixup: Better representations by interpolating hidden states Adam: a method for stochastic optimization performed the research; S.E. analyzed data; and The authors declare no competing interests. Supplementary Information The online version contains supplementary material available at https:// doi. org/ 10. 1038/ s41598-021-93545-6.