key: cord-208698-gm0b8u52
authors: Fazeli, Shayan; Moatamed, Babak; Sarrafzadeh, Majid
title: Statistical Analytics and Regional Representation Learning for COVID-19 Pandemic Understanding
date: 2020-08-08
journal: nan
DOI: nan
sha: 
doc_id: 208698
cord_uid: gm0b8u52

The rapid spread of the novel coronavirus (COVID-19) has severely impacted almost all countries around the world. It not only has caused a tremendous burden on health-care providers to bear, but it has also brought severe impacts on the economy and social life. The presence of reliable data and the results of in-depth statistical analyses provide researchers and policymakers with invaluable information to understand this pandemic and its growth pattern more clearly. This paper combines and processes an extensive collection of publicly available datasets to provide a unified information source for representing geographical regions with regards to their pandemic-related behavior. The features are grouped into various categories to account for their impact based on the higher-level concepts associated with them. This work uses several correlation analysis techniques to observe value and order relationships between features, feature groups, and COVID-19 occurrences. Dimensionality reduction techniques and projection methodologies are used to elaborate on individual and group importance of these representative features. A specific RNN-based inference pipeline called DoubleWindowLSTM-CP is proposed in this work for predictive event modeling. It utilizes sequential patterns and enables concise record representation while using but a minimal amount of historical data. The quantitative results of our statistical analytics indicated critical patterns reflecting on many of the expected collective behavior and their associated outcomes. Predictive modeling with DoubleWindowLSTM-CP instance exhibits efficient performance in quantitative and qualitative assessments while reducing the need for extended and reliable historical information on the pandemic.

I N the early days of the year 2020, the world faced another widespread pandemic, this time of the COVID-19 strand, otherwise known as the novel coronavirus. The family of Coronaviruses to which this RNA virus belongs can cause respiratory tract infections of various severities. These infections range from cases of the common cold to the more lethal degrees. Many of the confirmed cases and deaths reported due to COVID-19 showed evidence of severe forms of the aforementioned infections [1] , [2] , [3] . The origin of this new virus is still not clearly understood; however, it is believed to be mainly connected to the interactions between humans and particular animal species [3] .

The rapid spread of this virus has led to many lives being lost and extremely overwhelmed the health-care providers. It also led to worldwide difficulties and had considerable negative economic impacts. It is also expected to have adverse effects on mental health as well due to prolonged shutdowns and quarantines, and there are guidelines published to help minimize this negative impact [4] .

In this work, we have gathered, processed, and combined several well-known publicly available datasets on the COVID-19 outbreak in the United States. The idea is to provide a reliable source of information derived from a wide range of sources on important features describing a region and its population from various perspectives. These features primarily have to do with demographics, socio-economic, and public health aspects of the US regions. They are chosen in this manner because it is plausible to assume that they can be potential indicators of commonalities between the affected areas. Even though finding causality is not the objective of this work, our analyses attempt to shed light on these possible commonalities that allow public health researchers to obtain a better perspective on the nature of this pandemic and the potential factors contributing to a slower outbreak. This is vitally important as the critical role of proper policies enforced at the proper time is evident now more than ever.

There has been widespread attention in the design and utilization of Artificial Intelligence-based tools to obtain a better understanding this pandemic. Accordingly, we present a neural architecture with recurrent neural networks in its core to allow the machine to learn to predict pandemic events in the near future, given a short window of historical information on static and dynamic regional features. The main assumption that this work attempts to empirically validate is that the concise arXiv:2008.07342v1 [cs.CY] 8 Aug 2020 pandemic-related region-based representations can be learned and leveraged to obtain accurate outbreak event prediction with only minimal use of the historical information related to the outbreak. Aside from the theoretical importance, an essential application of this framework is when the reported historical pandemic information, e.g., number of cases, is not reliable. An example of this is when a region discovers a problem in its reporting scheme that makes the historical information on the pandemic inaccurate due to overestimation or underestimation. Such unreliability will severely affect the models which use this historical information as the core of their analysis.

In summary, the contributions of this work are as follows:

• Gathering and providing a thorough collection of datasets for the fine-grained representation of US counties as subregions. This collection includes data from various US bureaus, health organizations, the Center for Disease Control and Prevention, and COVID-19 epidemic information. • Evaluation of the informativeness of individual features in distinguishing between regions • Correlation analyses and investigating monotonic and non-monotonic relationships between several key features and the pandemic outcomes • Proposing a neural architecture for accurate short-term predictive modeling of the COVID-19 pandemic with minimal use of historical data by leveraging the automatically learned region representations Given the importance of open-research in dealing with the COVID-19 pandemic, we have also designed OLIVIA [5] . OLIVIA is our online interactive platform with various utilities for COVID-19 event monitoring and analytics, which allows both expert researchers and users with little or no scientific background to study outbreak events and regional characteristics. The codes for this work and the collection of datasets are also available as well.

Since the beginning of the COVID-19 pandemic, there have been efforts in utilizing computerized advancements in controlling and understanding this disease. An example is the applications developed to monitor the patients' locations and routes of movement. A notable work in this area is MIT's SafePaths application [6] that contains interview and profiling capability for places and paths. It is worthwhile to mention that these platforms have also caused worries regarding maintaining patients' privacy [7] .

To provide researchers and government agencies with frequently updated monitoring information regarding the coronavirus, 1point3acres team has provided an API that allows access to the daily updated numbers of coronavirus cases [8] , [9] . Several datasets such as [10] are also released to the public.

A large corpus of scientific articles on coronaviruses is released as well as a result of a collaboration between AllenAI Institute, Microsoft Research, Chan-Zuckerberg Initiative, NIH, and the White House [11] .

There have been projects such as a work at John Hopkins University that are focused on providing US county-level summaries of COVID-19 pandemic information and important attributes [12] , [13] .

The information in social networks has also been used in predicting the number or COVID-19 cases in mainland China [14] . The work in [15] is also focused on an AI-based approach for predicting mortality risk in COVID-19 patients.

There have been numerous approaches to model the pandemic using AI that have the historical outbreak information at the core of their analyses, such as the modified versions of SEIR model and ARIMA-based analysis [13] , [16] , [17] , [18] , [19] , [20] . This work is distinguished from the mentioned projects and the majority of statistical works in this area in the sense that it is targeting the role of region-based features in the Spatio-temporal analysis of the pandemic with minimal use of historical data on the outbreak events. The area unit of this work is US county which enables a more fine-grained prediction scheme compared to the other works that have mostly targeted the state-level analytics. To our best knowledge, the works in [16] and [21] are the only attempts in county-level modeling of the disease dynamics. In [16] , authors have proposed a non-parametric model for epidemic data that incorporates area-level characteristics in the SIR model. The work in [21] uses a combination of iterated filtering and the Ensemble Adjustment Kalman filter for tuning their model, and their approach is based on a county-level SEIR model. The empirical results show that our approach outperforms these models on the evaluation benchmarks while providing a framework for utilizing deep learning in analysis and modeling the short-term pandemic events. We have made our codes and data publicly available and regularly maintained to help to expedite the research in this area.

This study focuses on analyzing the regions of the United States with statistical and AI-based approaches to obtain results and representations associated with their pandemic-related behavior. A primary and essential step in doing so is to prepare a dataset covering a wide range of information topics, from socio-economic to regional mobility reports. More details regarding the primary data sources from which we have obtained information for this work's dataset are elaborated upon hereunder. 1) COVID-19 Daily Information per County: Our first step towards the mentioned objective is to gather the daily COVID-19 outbreak data. This data should include the number of cases that are confirmed to be caused by the novel coronavirus and its associated death toll. We are using the publicly accessible dataset API in [8] , [9] to fetch the relevant data records. The table of data obtained using this API contains the numerical information along with dates corresponding to each record, and each document includes the number of confirmed cases and the number of deaths that occurred due to COVID-19 on that date. It also includes the number of recoveries from COVID-19 in the same format. This dataset's significance is that it provides us with a detailed and high-resolution temporal trajectory of the COVID-19 outbreak in different urban regions across the United States. Using the dates, one can constitute a set of time-series for every county and monitor the outbreak along with the other metadata to make relevant inferences.

2) US Census Demographic Data: The US Census Demographic Data gathered by the US Census Bureau [22] plays a critical role in our analysis by providing us with necessary information on each region's population. Additionally, this information includes specific features such as the types of work people in that region mainly take part in, their income levels, and other invaluable demographical and social information.

3) US County-level Mortality: The fluctuations in the mortality rate of a region is also a potential critical feature in pandemic analytics. The US county-level mortality dataset was incorporated into our collection to add the high-resolution mortality rate time-series throughout the years [23] , [24] . The age-standardized mortality rates provide us with information on variables, the values of which can be considered as the effects of specific causes. It is crucial since some of these causes might have contributed to the faster spread of COVID-19 in different regions [25] . 4) US County-Level Diversity Index: Another dataset that offers a race-based breakdown of the county populations is available at [26] with the diversity index values corresponding to the notion of ecological entropy. For a particular region, if K races comprise its population, the value of diversity index can be computed using the following formula:

In the above formula, N is the total population and n i is the number of people from race i. This formula represents the probability p, which means that if we randomly pick two persons from this cohort, they are of different races with probability p. In addition to that, we have the percentages of different races in the regional population as well. 5) US Droughts by County: Another source of valuable information regarding the land area and water resources per county is the data gathered by the US drought monitor [27] , [28] . This data is incorporated into our collection as well.

6) Election: Based on the 2016 US Presidential Election, a breakdown of county populations' tendencies to vote for the main political parties is available [29] . These records are added to our collection as the democratic-republican breakdown of regional voters can reflect socio-economic and demographical features that form the underlying reasons for the regional voting tendencies. 7) ICU Beds: Since COVID-19 imposes significant problems in terms of the extensive use of ICU beds and medical resources such as mechanical ventilators, having access to the number of ICU beds in each county is helpful. This information offers a glance at the medical care capacity of each region and its potential to provide care for the patients in ICUs [30] . It could be argued that having knowledge of the ICU-related capacity of regional healthcare providers can, to some extent, represent the amount of their COVID-19 related resources, such as ventilators and other needed resources.

The aggregate dataset on central statistical values on the US household income per county (including average, median, and standard deviation) is used to provide information on the financial well-being of the affected regions' occupants [31] . 9) COVID-19 Hospitalizations and Influenza Activity Level: Aside from the socio-economical and demographical features of a region, the number of active and potential COVID-19 cases is a critical factor. This information can be leveraged to provide a possible threat level for the region. These records are made available by CDC for specific areas and are incorporated into our collection as well [32] , [33] .

10) Google Mobility Reports: The COVID-19 virus is highly contagious. Therefore, the self-quarantine and social distancing measures are principal effective methodologies in bolstering the prevention efforts. Our collection includes Google's mobility reports obtained from [12] . These records elaborate on the mobility levels across US regions, which are broken down into the following categories of mobility: 1) Retail and Recreation 2) Grocery and Pharmacy 3) Parks 4) Transit Stations 5) Workplaces 6) Residential In addition, we have computed a compliance measure that has to do with the overall compliance with the shelter at home criteria: 

.0 In the above formula, m i is the mobility report for the ith mobility category. This value is computed through time to provide an overall measure of mobility through time. The compliance measures of +1 and −1 mean +100% and −100% changes from the baseline mobility behavior, respectively. 11) Food Businesses: Restaurants and food businesses are affected severely by the economic impacts of this outbreak. At the same time, they have not ceased to provide services that are essential and required by many. To reach a proper perspective of the food business in each region, we have prepared another dataset based on records in [34] to provide statistics on regional restaurant revenue and employment. Analysis of restaurants status is important in the sense that they are mostly public places that host large gatherings, and in the time of a pandemic, their role is critical.

12) Physical Activity and Life Expectancy: Various features have been selected from the dataset in [35] to reflect on the obesity and physical activity representation for different US regions. These features include the last prevalence survey and the changes in patterns. Also, Life Expectancy related features are valuable information for representing each region. They are included as well in our analyses.

13) Diabetes: Different features to represent a region according to the diabetes-related characteristics were selected from the data in [35] . These include age-standardized features and clusters that have to do with diabetes-related diagnoses.

14) Drinking Habits: Information on regional drinking habits from 2005-2012 has also been used in this work [35] . This information includes the proportions of different categories of drinkers clustered by sex and age. The categories are as follows:

• Any: a minimum of one drink of any alcoholic beverage per 30 days • Heavy: a minimum average of one drink per day for women and two drinks for men per 30 days • "Binge: a minimum of four drinks for women and five drinks for men on a single occasion at least once per 30 days 15) Analytics: In what follows, the analytical techniques that we have designed and used in this work are explained. To draw meaning from the data that we have at hand, we have designed and utilized a variety of techniques. These methodologies range from traditional statistical methodologies to the design and testing of deep learning inference pipelines for event prediction. We select a set of representative features to use in our analytics from the gathered collection of datasets. More details on the nature of these features are shown in Table I. 16) Feature Informativeness for Sub-region Representation: An important question that is raised in analyzing a dataset with well-defined categories of features is how important these features are in describing the entities associated with them. From the particular perspective of enabling the differentiation between two regions, it can be said that a measure of importance is the contribution of each one of these selected features to the overall variation in datapoints. The boundary case is that if a feature always has the same value, it is not informative as there is no entropy value associated with its distribution. To begin with, we associate a mathematical vector with each data point, which contains the values of all its dynamic and static features associated with a specific date and location. Since we are mainly targeting US counties in this study, each record would be associated with a US county at a specific date. We then use Linear Principal Component Analysis [36] to reduce the dimensionality of these data points and to evaluate the importance of the selected features in terms of their contribution to the overall variation. Results show that in order to retain over 98% of the original variance, a minimum of 55 principal components should be considered. Each one of these components is found as a linear combination of the original set of features, and that along with II  THE EQUATIONS FOR THE THREE MAIN CORRELATION ANALYSIS TECHNIQUES USED IN THIS WORK, NAMELY 

Pearson

the percentage of variance along the axis of that component can be used as a measure of performance. To be more specific, considering n features and m data points that result in p PCA components to retain 98% of the variation, we will have:

And u i is the total variance along the axis of ith PCA component. This can be thought of as a measure of importance for the PCA components, and the absolute value of v i s magnitudes can be considered as the importance of original feature i's contribution to its making. Therefore, we will have the following measure of informativeness defined for our features:

The features can be sorted according to these values, and the categories can also be considered in their relevant importance. Note that this is just one definition of informativeness; for example, certain features might not vary a lot, but when they do, they are potentially associated with severe changes in the COVID-19 events. Therefore, the importance score that has been captured here merely has to do with how better we are able to distinguish between locations based on a feature.

In order to better understand the co-occurrences of the features in our input dataset and their corresponding COVID-19 related events, we have performed an in-depth correlation analysis on them. We have considered four principal measures of correlation, namely: Pearson, Kendall, Histogram Intersection, and Spearman, as described in Table II . We have used the Pearson correlation coefficient along with the p-values to shed light on the presence or absence of a significant relationship between the values of each specific feature and each category of pandemic outcome. We have also computed nonparametric Spearman rank correlation coefficients between any two of our random variables. This value would be computed as the Pearson measure of the raw values converted to their ranks. The formulation is shown in Table II in which d i is the difference in paired ranks. Mutual information has also been used to provide additional information on such relationships. This coefficient measures the strength of the association between the values of these random variables in terms of their ranks. Since many of the relationships in our dataset can be intuitively thought of as monotonic, these values are particularly important. To better understand the concordance and discordance, Kendall correlation is computed as well. In the formulation shown in Table II , m 1 and m 2 are the numbers of concordant and discordant pairs of values, respectively. Normalized Histogram Intersection is another methodology directly targeting the distributions of these variables. The degree of their overlap represents how closely xs distribution follows the distribution of y. It has also been utilized in finding the results of this section.

In continuation of our statistical analyses on COVID-19 event distributions, we have designed a neural inference pipeline to help with the effective utilization of both learned deep representations and the embedded sequential information in the dataset.

In this work, we introduce a neural architecture, which is trained and used for COVID-19 event prediction across the US regions. The Double Window Long Short Term Memory COVID-19 Predictor (DWLSTM-CP) is comprised of multiple components for domain mapping and deep processing. First, using its dynamic projection which is a fully connected layer, the dynamic feature vectors which reflect on temporal dynamics will be mapped to a new space and represented with a further concise mathematical vector.

This step is essential due to the fact that an optimal deep inference pipeline is the one that retains only the information required by each level and minimizes redundancies [37] . The projections are designed to help the network achieve this objective. These are then fed to the LSTM core for processing. Each one of these outputs is concatenated with the projected version of static features, F static projection (x static ), and fed to the output regression unit. The outputs are compared with the ground truth time-series, and a weighted Mean Squared Error loss along with Norm-based regularization is used to guide the training process while encouraging more focus on the points with large values. The overall pipeline is shown in Figure 1 .

It is worth mentioning that this approach leverages and utilizes all of the features discussed in the previous sections. It learns representations that take various factors, from different categories of mobility and activities to socio-economic information, to make accurate short-term predictions while reducing the need for lengthy historical data on the pandemic outcomes. There are many occasions in which accurate and reliable historical data on the pandemic is not available due to a variety of reasons (e.g., a problem in reporting scheme), which motivates approaches with less dependency on it.

The results on our regional dataset in terms of feature importance from the principal component analysis indicate the following features contribute to the overall representation significantly:

• Restaurant businesses, namely the contribution to the state economy and the count of food and beverage locations. Even though we only have access to state-level data, its importance can be intuitively argued as it reflects on the counties that the state includes. This is due to the fact that the status of restaurants plays an essential role in such pandemics. • The influenza activity level is another critical feature in the analysis. Given the similarity of symptoms between Influenza and COVID-19 infection, monitoring Influenza activity is very helpful for COVID-19 pandemic understanding. • Diversity index, which signifies the probability of two randomly selected persons belonging to different races from a population, also plays a crucial role in representing the regions. • The changes in the mortality rate that is not associated with COVID-19 are beneficial as well. This is also intuitively arguable as it can be thought of as a measure of mortality related sensitivity for the regions. Figure 2 shows how the projected points scatter after the PCA as well. The results indicate that 55 PCA components are required to retain over 98% of the variance of the dataset, and Figure 3 shows the progress of covering the variance by adding 

The results of correlation analyses help empirically and quantitatively validate many of the relationships mentioned in the known hypotheses regarding the COVID-19 outbreak. The Pearson correlation of −28.67% with the p-value of 0.046 indicates a significant relationship between the percentage of food businesses in the state economy, and the average cumulative death count in its counties. Another example is the value of the Spearman correlation coefficients between the different types of commute to work associated with each county and the values of the pandemic-related events. From Table IV , it is apparent that there is a positive relationship between the proportion of public transit as a method of commute to work and the spread of COVID-19 in the region. Another example is the Pearson correlation between the ratio of different races in regions and the pandemic outcomes. It is known that COVID-19 is affecting the African American community disproportionately [38] . Accordingly, the values in Table V show a higher correlation between the ratio of African Americans and the severity of COVID-19 outcomes. Cumulative Covered Variance by Using PCA Components sorted by their informativeness Fig. 3 . The cumulative amount of variance covered by using up to a certain number of PCA components. This is assuming that they are sorted by their corresponding eigenvalue, meaning that the first component contributes more to variance coverage than the ones selected after it. 

The collected set of datasets in this work provide a sufficient number of records for enabling the efficient use of Artificial Intelligence for Spatio-temporal representation learning. We show this by training instances of our proposed DoubleWindowL-STM architecture on the two main short-term tasks regarding epidemic modeling; namely, new daily death and case count. In our dataset, we considered the US COVID-19 information from March 1st, 2020 to July 22nd, 2020, in which the July data is used for our evaluations, and the rest are leveraged for training and cross-validation. The objective using which the proposed architecture was trained is a multi-step weighted Mean Squared Error (MSE) loss, which helps to minimize a notion of distance between the predictions and the target ground-truth while encouraging (by assigning larger weights) to the windows that exhibit larger values. These thresholds are empirically tuned and set prior to the training procedure. The learning curves for both experiments indicate clear convergence in Figure 4 .

To quantitatively evaluate the performance, we have reported the Root Mean Square Error (RMSE) for the prediction of new daily deaths and cases due to COVID-19 in Table VI . For comparison, we have used the ARIMA model as well with the parameters set according to the work in [20] that have fine-tuned this scheme for forecasting the dynamics of COVID-19 cases in Europe. We have also found the best ARIMA model in each scenario according to Augmented Dickey-Fuller (ADF) tests and based on Akaike information criterion (AIC) and reported the results denoted by ARIMA*. To compare with other works in this area, we had to aggregate our county-level findings to form estimators for state-level prediction. From the results reported in Table VII , it is interesting to observe that the aggregated estimator based on our model achieves strong evaluation result comparable to the models that achieve highest scores, while clearly outperforming the other two models that are inherently county-level, namely, the works in [16] and [21] . The predictions for severak regions exhibiting different severities are shown in Figure 5 . These results can help the reader in a qualitative assessment of the model performances, in which the outputs of our approach demonstrate high stability and follow the trajectory of the ground-truth with precision. 

The primary objective of this work is focused on leveraging regional representations for accurate short-term predictive modeling of the epidemic with minimal use of historical data. It is plausible to assume that the features chosen in this work, which reflect on different characteristics of a region, include valuable information for efficient prediction of pandemic events. The static features include various socio-economic and demographical properties associated with a region and its population. Combined with the dynamic set of features such as influenza activity level and mobility patterns, this information was leveraged along with a short track of pandemic time-series for predictive modeling. We do not claim that the data points coming from this domain are statistically sufficient for the pandemic event prediction tasks; however, empirical results indicate that they can be effectively utilized for these objectives. There are occurrences outside of this domain that can impact the outcomes (e.g., the initial impact of a large number of infected people arriving in a specific location is not initially captured by our scheme). Nevertheless, the results indicate that the data points coming solely from this work's domain can help in the effective knowledge extraction regarding the current and future values of pandemic-related time-series. The result section elaborated on the statistical findings and introduced a measure of feature importance. In addition, a neural network architecture that has a long short-term memory configured recurrent neural network in its core was introduced to serve as a new baseline for COVID-19 event prediction.

Since the beginning of the COVID-19 outbreak, there have been works focusing on gathering information or performing statistical analysis related to this epidemic. This work is focused on learning and analysis of the high-resolution spatiotemporal representation of urban areas. We provide a collection of datasets and select a large number of features to reflect on various demographics, socio-economics, mobility, and pandemic information. We have used statistical analysis techniques to investigate the relationships between individual features and the epidemic, while also considering the contribution of such features to the overall representation power. We have also proposed a deep learning framework to validate this idea that such region-based representations can be leveraged to obtain accurate predictions of the epidemic trajectories while using but a minimal amount of historical data on the outbreak events (e.g., number of cases). Even though are model is trained with the objective of providing county-level predictions, we have aggregated these county-level predictions and used these now state-level estimators to evaluate the loss on the most recent data. In Table 6 , we have compared these results with the information on the similar performance measure of the eight COVID-19 prediction works that perform state-level inference making. It can be seen that our framework provides a simple solution which outperforms the other county-level methodologies (namely, [16] and [21] ) on this task.

The importance of clearly defined policies enforced at the proper time on alleviating the adverse impacts of a pandemic in different areas is crystal clear. One of the important applications of this work is in providing researchers and agencies with a more in-depth understanding of the co-occurrence of idiosyncratic patterns associated with regions and the predicted pattern of the outbreak. This information can be used to assist policymakers, for example, to render the details of their decisions such as lockdowns, more fine-grained and attuned to the regional needs. These include the intensity and length of enforcing such measures. The ability to predict pandemic-related occurrences (e.g., number of deaths, cases, and recoveries) is another valuable application of this work. This knowledge will provide hospitals and healthcare facilities with targeted information to help with the efficient allocation of their resources. Another important application of this work is when there is a lack of availability for accurate and reliable historical data on the epidemic events. For example, when it is realized that the previous reports on the number of cases and deaths due to the pandemic were not reliable, such finding will not affect our solution due to its less degree of dependence on the historical data on the epidemic than other models which base their analysis on them at the core of their analyses.

This study has several limitations that should be discussed. The initial notion of feature informativeness which was discussed in the earlier sections of this article mainly has to do with the contribution of features to the variance in representing regions and areas. Given the nature of this study, combining this and the relationship between them and the pandemic and providing more in-depth prior domain knowledge can help with a better definition of feature importance. Our methodology provides a means to use region-based representations to obtain predictions with less reliance on the historical epidemic data. Nevertheless, generalizing the network architecture in this work and providing access to more extended and reliable historical data, if possible, can be an improvement and is worthwhile as a potential future direction. Utilizing attention-based methodologies and other interpretation techniques with the pre-trained weights is also a well-suited future direction to better understand what the models learn.

In this study, we gathered a collection of datasets on a wide range of features associated with US regions. Our approach then used various statistical techniques and machine learning to measure the relationship between these regional representations and the pandemic time-series events and perform predictive modeling with minimal use of historical data on the epidemic. Both quantitative and qualitative evaluations were used in assessing the efficacy of our design, which renders it suitable for applications in various areas related to pandemic understanding and control. This is crucial since the information on the patterns and predictions related to an outbreak play a critical role in elaborate preparations for the pandemic, such as improving the allocation of resources in healthcare systems that will otherwise be overwhelmed by an unexpected number of cases.

It is important for a predictive modeling approach on the pandemics to be able to help when the epidemic is in its early stages. To evaluate the performance of our approach, we have performed experiments on the early stages of the COVID-19 pandemic as well. In this particular dataset, the March 1st, 2020 to May 5th, 2020 date range is covered. Using a k-fold validation approach, the performance of the model is evaluated and reported in Table ? ?. It is shown that the network operates significantly better than ARIMA*, the details of which were discussed in the article. Please note that ARIMA based models have shown success in predicting COVID-19 events in the literature. 

In the first appendix, the performance of the model on the two main tasks regarding COVID-19 predictions and simulations was demonstrated. To add on that, Table X shows the performance of the model on the task of predicting normalized cumulative  TABLE IX  THIS TABLE SHOWS THE AVERAGE DAILY ROOT MEAN SQUARE ERROR FOR THE DWLSTM MODEL COMPARED TO THE ARIMA* PREDICTIONS. THE  EVALUATIONS ARE DONE USING A DATASET THAT CONTAINS ONLY THE EARLY STAGES OF THE COVID-19 OUTBREAK IN THE US. THE OBJECTIVE IN  THE FOLLOWING EXPERIMENTS WAS TO PREDICT THE NEW DAILY CONFIRMED COVID-19 death counts for each county which is attributed to the pandemic. The other factor that is shown in Table X is the variations of the performance level by changing the length of the prediction window. This suggests that in the early stages, since the available data is limited, choosing smaller windows would help with the performance. However, based on the results in the article we came to know that as more data becomes available, the performance on the longer windows can be significantly improved. 

As an experiment to show the impact of the highly affected areas in teaching the machine learning model in our approach, we have tried removing the counties of New York state from the dataset and showed the results in Table ? ?. The results indicate that in terms of quantitative assessment, the lack of presence for the highly affected areas causes a significant drop in the loss values. However, the qualitative analysis showed that the models do not perform well in the case of rising values, as the amount of information available on such cases to train the network on is fairly limited. This causes both family of models to be biased in making predictions that tend to underestimate the target values. 

Virus taxonomy

Bat coronaviruses in china

Viral metagenomics revealed sendai virus and coronavirus infection of malayan pangolins (manis javanica)

Mental health and coping with stress during covid-19 pandemic

Olivia health analytics platform

Private kits: Safepaths; privacy-by-design covid19 solutions using gps+bluetooth for citizens and public health officials

Apps gone rogue: Maintaining personal privacy in an epidemic

Covid-19/coronavirus real time updates with credible sources in us and canada

Covidnet: To bring the data transparency in era of covid-19

Novel coronavirus 2019 dataset

Covid-19 open research dataset challenge (cord-19)

A county-level dataset for informing the united states' response to covid-19

Comparing and integrating us covid-19 daily data from multiple sources: A county-level dataset with local characteristics

Using reports of symptoms and diagnoses on social media to predict covid-19 case counts in mainland china: Observational infoveillance study

Predicting mortality risk in patients with covid-19 using artificial intelligence to help medical decision-making

Spatiotemporal dynamics, nowcasting and forecasting of covid-19 in the united states

Covid-19 simulator

Learning to forecast and forecasting to learn from the covid-19 pandemic

Fast and accurate forecasting of covid-19 deaths using the sikja model

Arima-based forecasting of the dynamics of confirmed covid-19 cases for selected european countries

Initial simulation of sars-cov2 spread and intervention effects in the continental us

Us census demographical data

Us mortality rates by county

Us county-level mortality

Us county-level trends in mortality rates for major causes of death

Diversity index of us counties

Us drought monitor

United states droughts by county

County presidential election

Icu beds by county in the us

Us household income statistics

A weekly summary of us covid-19 hospitalization data

Laboraty-confirmed covid-19 associated hospitalizations

State statistics

Us data for download

Principal component analysis

Deep learning and the information bottleneck principle

4 reasons coronavirus is hitting black communities so hard

Covid-19 data in the united states

Us facts dataset