key: cord-1010151-0seg6rpw authors: Hawas, Mohamed title: Generated Time-series Prediction Data of COVID-19′s Daily Infections in Brazil by Using Recurrent Neural Networks date: 2020-08-19 journal: Data Brief DOI: 10.1016/j.dib.2020.106175 sha: 71e6e60a23930e05c2a2274b989a9aece9955518 doc_id: 1010151 cord_uid: 0seg6rpw In light of the COVID-19 pandemic that has struck the world since the end of 2019, many endeavors have been carried out to overcome this crisis. Taking into consideration the uncertainty as a feature of forecasting, this data article introduces long-term time-series predictions for the virus's daily infections in Brazil by training forecasting models on limited raw data (30 time-steps and 40 time-steps alternatives). The primary reuse potential of this forecasting data is to enable decision-makers to develop action plans against the pandemic, and to help researchers working in infection prevention and control to: 1) explore limited data usage in predicting infections. 2) develop a reinforcement learning model on top of this data-lake, which can perform an online game between the trained models to generate a new capable model for predicting future true data. The prediction data was generated by training 4200 recurrent neural networks (54 to 84 days validation periods) on raw data from Johns Hopkins University's online repository, to pave the way for generating reliable extended long-term predictions. • This data is useful because it provides researchers with a solid background foundation for predicting COVID-19's daily infections by depending on very limited data (30 time-steps and 40 time-steps alternatives). This facilitates the prediction process by reducing the crucial need for big data that is required to train deep recurrent neural networks. • Government institutions and infection control units can utilize and filter this data to estimate and develop the required action plans. As well, researchers can use them in spatial models related to the geographical distribution of the pandemic and the increase in numbers. • The data generated by the models represent a data lake that can help researchers to build a reinforcement learning model that can learn how to classify and select the fittest models against upcoming infection rates. More importantly, a reinforcement learning model that adjusts and combines between the best weights in these trained models can be developed to construct a new prediction model for extended long-term prediction purposes. • Although real numbers of infected people are higher than reported as there is a worldwide limited capacity to provide more tests to people, therefore, this data helps in modelling real numbers more accurately in the studies that perform an estimation of infection prevalence. • This data can very useful for social scientists who aim at developing analysis frameworks that compare and study the social structure and socio-economic conditions in different countries and cultures, and the relation of these factors to the prevalence of the pandemic. In the past five months, many non-medical data articles have focused on critical aspects of transitionrelated analysis and data of COVID-19's case, as infections data collection, filtering, geographical mapping of infections [1] , [2] , and forecasting. In that sense, forecasting studies utilized various approaches as exponential smoothing models [3] , estimation of the daily reproduction number [4] , Susceptible-Infectious-Recovered-Dead (SIRD) models [5] , ARIMA models [6] , [7] , and nonlinear autoregressive artificial neural networks -NARANN models [7] . Furthermore, several approaches depended on using more or different data as the SEIRQ model [8] , which involved using seven categories of data, one of them is the number of infected people. Similarly, a research study for predicting cases in China depended on extra data from SARS and MERS diseases that are fitted on an exponential growth model [9] . As well, an improved version of Susceptible Exposed Infectious Recovered -SEIR was used with extra data about the intervention and quarantine strategies against the pandemic [10] . On the other hand, this data article focuses on long-term forecasting of the daily infections in Brazil by using limited data of infections numbers only on a recurrent neural network structure that uses Gated Recurrent Unit -GRU mechanism that is similar to Long Short-Term Memory -LSTM mechanism. The choice of 30-and 40-days' time windows is made experimentally as explained in the methods section. Although there is an accompanying uncertainty in the generated predictions, however; the evaluation of the models' performance over various durations (not less than 54 days as shown in the metadata Table 1 ), shows the possibility of achieving long-term predictions by using limited data. In that sense, the generated predictions show polynomial trendlines between orders two and four for the case of Brazil rather than developing a non-stopping exponential or power trend that grows indefinitely. These RNNs are boosted by an online adversarial linear regression evaluation function, which performs a day-by-day correction of the models' generated data over the whole duration. As the prediction is made till 2020-10-01, this date was selected based on the news of early vaccine release by September 2020 by the British-Swedish pharmaceutical company AstraZeneca [11] among many other pharmaceutical firms such as Moderna that announced manufacturing 100 million doses by the 3 rd quarter of 2020 [12] . Similarly, the selection of Brazil is based on the importance of predicting the infection numbers in one of the world's populous countries, which is currently ranked as 2 nd in the world regarding the number of total infections with COVID-19 [13] (2020-07-13). Moreover, choosing the second in rank prioritizes having no external factor that can have a substantial influence over the pandemic's behavior, which is possible in the case of the USA (ranked as 1 st [13] ), where massive demonstrations occurred during June, and July 2020 [14] and this requires different measures to neutralize the influence. In that sense, 4200 Deep recurrent neural networks RNNs were built by Keras Library in Python environment and by using limited data (30 time-steps and 40 time-steps) from COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University [13] that includes infections numbers from January 22, 2020, to several dates -indicated in the metadata in Table 1 . These trained models are categorized as follows: 3173 in deterministic mode, 20 in non-deterministic mode, 1001 in a non-deterministic mode for validation, 1 model as a control group for validation by using the data of India, and 5 models for showing performance of different time-steps. The csv file that includes the training data was accessed, processed, cropped, and divided without scaling into training/testing/evaluation sets. Subsequently, the long-term prediction data in this article was generated using the trained models. The deterministic setup aspect is explained in the methods section, as well, the metadata, inference settings, and training settings are located in csv and json files, which can be accessed at the data repository. Overall, the data in this article consists of models, prediction tables, graphs, settings tables, and a Python notebook. The data is split into two datasets and both are uploaded in two online data repositories. Dataset one includes the model files, predictions, graphs, and metadata, and Dataset two includes the code as one Jupyter Notebook file in Python programming language. Dataset one is divided into three folders: 1) (Deterministic mode), 2) (Non-deterministic mode), 3) (Technical validation), and one compressed zip file that includes all files and folders in the dataset. 07-04-2020 07-04-2020 07-04-2020 08-03-2020 07-04-2020 1 End date for training data 06-05-2020 06-05-2020 06-05-2020 06-04-2020 06-05-2020 2 Start date for evaluation data 07-05-2020 07-05-2020 07-05-2020 07-04-2020 07-05-2020 3 End date for evaluation data 29-06-2020 29-06-2020 29-06-2020 13-06-2020 29-06-2020 11-07-2020 4 Duration of evaluation data 54 days 54 days 54 days 68 days 84 days 66 days 5 Start date for training process 28-06-2020 28-06-2020 03-07-2020 13-06-2020 12-07-2020 6 End date for training process 30-06-2020 01-07-2020 03-07-2020 13-06-2020 12-07-2020 In Dataset one, the deterministic folder contains two folders: 1) 30 time-steps, and 2) 40 time-steps. On the other hand, the Non-deterministic folder and Technical validation contain data for 30 time-steps. The generic structure for each time-steps subfolder contains the following four folders: 1. (Predictions) folder, including prediction tables (to 2020-10-01) in (.csv) extension. Each csv file includes the predicted daily infections table that is generated by using a trained model. Each table includes three columns: 1) date, 2) model prediction, 3) evaluated prediction. The folder includes an extended evaluation for the best model till 2020-07-11. 2. (Graphs) folder, including prediction graphs (to 2020-10-01) in (.pdf) extension. Each graph describes the model's performance against the true data. The graphs mainly include four highlighted trendlines: 1) prediction for the period of the time-steps, 2) prediction for all the validation period, 3) prediction for four steps only, and 4) prediction to the target date. The folder includes an extended evaluation for the best model till 2020-07-11. 3. (Settings) folder, including: settings files in (.json) extension. This folder includes all the settings that are used for training the models. The settings folder includes (Metadata) folder, which comprises four tables in (.csv) extension: 1) (dates_info_to_2020-10-01.csv) file includes dates of training and prediction processes. 2) (models_accuracy_settings_to_2020-10-01.csv) file includes models' settings and accuracy. 3) (gen_data_info_to_2020-10-01.csv) file includes counts of generated data. 4) (best_model_accuracy_settings_to_2020-10-01_.csv) file includes settings and accuracy of the best model. 4. (Trained models) folder, including model files in (.h5) extension. These files are the trained models and they can be used in inference by using the predict function in the Python notebook. The technical validation folder includes an additional folder named (Control group) for the model of India. As well, there is an additional folder that includes the data for the performance of different timesteps -till 2020-06-29. The following tables and figures show a brief about the results. First, Table 2 shows the best performing 10 models in the deterministic mode 1, Figure 1 shows the graph for the third-best performing model that is not showing a non-stopping exponential growth, and Table 3 shows the settings of this model. On the other hand, Figure 2 shows different trends in this mode. 0.57569909 1909 model_trained_cf40d4ca-6cd7-44dc-a1e9-bcb190a5d466.h5 0.574641935 Similarly, Table 4 shows the best performing 10 models in the deterministic mode 2, Figure 3 shows the graph for the second-best performing model in this mode showing a polynomial growth (first and second models shown in Table 4 are identical in performance -non-stopping exponential growth -thus, they are ranked together as first), and Table 5 shows the settings of this model. On the other hand, Figure 4 shows different trends in this mode. Thirdly, Table 6 shows the best performing 10 models in the non-deterministic mode, Figure 5 shows the graph for the second-best performing model that shows a polynomial growth, and Table 7 shows the settings of the best performing model in the non-deterministic mode. On the other hand, Figure 6 shows different trends in this mode. To ensure the reliability of the data, technical validation of the data was performed. The csv file for daily COVID-19 infection numbers was accessed from COVID-19 Johns Hopkins University's Data Repository, processed, cropped, and divided into training/testing/evaluation sets. Predominantly, allowing a relatively large evaluation period (not less than 54 days) reflected a better understanding of the models' generalizability without new data, this is particularly important in tackling time-sensitive prediction processes with limited data as a basis for decision-making. The evaluation periods are shown in the metadata Table 1 . As well, generating various models within a range of variations of settings serves the exploration of most efficient settings that would benefit future applications and potential reuse of the dataset. In that sense, after carefully looking at the trendline in the generated graphs, the data shows that a certain degree of uncertainty is reflected in the generated models while maintaining logical forecasting of the future that doesn't involve a non-stopping exponential growth trend regardless of the 85% high accuracy in the validation model. The case of the 85% accurate validation model in Figure 7 shows a non-stopping exponential growth trend that fits that data -over 68 days till 2020-06-13 -with fast growth in time-series predictions beyond this date. This rapid growth describes a failing long-term prediction of a total infected people to be 108095368 over 110 days from 2020-06-14 till 2020-10-01, which is almost half of Brazil's population [15] . On the other hand, the trend that is associated with the true data is approximating a polynomial trendline. Moreover, when testing this accurate model against new true data till 2020-06-29, the accuracy drops to a maximum of 68% (84 days evaluation), while maintaining the same steep exponential growth trendline. The period between 2020-05-31 and 2020-06-29 (30 time-steps) as indicated in Table 6 , shows that the model is adopting a fast growth pattern that is cannot maintain logical long-term forecasting. The fast growth pattern is repeated on dates before 2020-05-31 and after 2020-06-29. This pattern has worked as a guideline for identifying logical models that might be less accurate than 85% as in this validation model, but still more applicable than the validation model. The major difference between this validation model and the other models in the dataset is the cropping date of the data. This cropping point filters the data fed to the validation model to start from 2020-03-08 and end at 2020-04-06. The error -non-stopping exponential growth -can be reproduced when setting the crop-point of the data to day 80 since 2020-01-22. Overall, this validation model provides three main indicators: 1) Around 3 months of validation, indicate that the exponential growth pattern can fit true data on a long-term basis (more than the used time-steps as input to the model, ex. 15 time-steps or 20 time-steps and not more than a month), but it becomes not reliable when it comes to extended long-term predictions. This suggests that a polynomial trendline of order 2 to 4 is more probable to describe the data. 2) Croppoints influence the accuracy and the growth pattern of the trendlines in the models, which is the reason behind testing many crop-points. 3) The exponential growth that is generated by models can be controlled and reduced by adjusting the crop-point to day 110. Figure 7 shows the prediction graph for one of the validation models. The complete validation dataset includes 1001 models, by taking 358 models as a sample that shows a 95% confidence level and 4.15% margin of error, the following Table 9 shows that long-term prediction is achievable during 68 days despite exponential growth afterward, as we can be positive that 77%, 81%, 64%, and 1% of the models can achieve more than 50%, 60%, 70% and 80% accuracy, respectively -by using same settings. Nevertheless, using such settings allowed a non-stopping exponential growth pattern. On the other hand, the deterministic and non-deterministic datasets have fallen into the second and third categories (the models have scored an accuracy of 50% to less than 60%, and 60% to less than 70%). However, two factors have a great influence over the results: the first factor involves the uncontrolled randomness that was eliminated to allow limited reproducibility throughout the training session, as this might have limited the achievement of higher accuracy. As a result, the desired polynomial curve was achievable at the expense of limiting the higher accuracy due to controlled randomness. Consequently, the controlled randomness mostly provides the phases of the logistic function, the initial log phase (slow growth), followed by near-exponential growth, the transitional phase (the slowdown), the saturation phase (transitional phase) then the maturity phasea plateau or stationary phase -(the growth stops). In that sense, the second factor that influenced the results is the crop-point (a source of randomness itself) of the data, as the "numbers" of the input data -the daily infections numberscontain inherited randomness that could lead to a pattern that translates into higher accuracy. This particularly means that there is a certain correlation between the numbers that allow better forecasting at a certain crop-point and worse predictions at another crop-point while using the same settings. However, as this validation dataset is non-deterministic, meaning that there is a certain degree of randomness, then the most controlling factor is the crop point as the other three datasets have been tested by using similar settings except for the crop-point (the two deterministic datasets have shown similar results to the non-deterministic dataset in terms of growth). This highlights that the crop-point as a source of randomness exceeds in influence the initial randomness that is created by the neural network to initialize the weights of the network. Consistently, this certain pattern or correlation can be overcome by allowing a higher level of randomness and much greater variations of settings. This can be computationally impractical and timeconsuming. Therefore, the objective of creating a data lake of with many variations in data points and trends, to develop a hybrid leading reinforcement model on top of this data lake, is more feasible in terms of speed and efficiency as these variations can ease the process of assigning or clustering them to reward and penalty classes in the deep reinforcement learning model. Further validation has been performed over the three best models in the three modes that are reported in the deterministic and non-deterministic modes, Table 10 shows updated performance till 2020-07-11. Accuracy till 2020-06-29 Accuracy till 2020-07-11 Deterministic Table 10 . Evaluating Performance till 2020-07-11. Source: Author The changes in performance are limited, but they reflect the unstable frequency that is inherited in the true data. This unstable frequency can be of natural cause related to the pandemic or a reflection of difficulties due to high numbers of infections and the implications of this on the health care system. Overall, this proves the need for highly varied predictions for the case of Brazil as in this dataset to get a clearer picture of the situation. On the other hand, India was chosen as a control group for validation to compare against Brazil in terms of the model's accuracy by using the same RNN's architecture and the same crop-point. In the case of India, a model was capable of achieving 95% accuracy till 2020-07-11 (61 days), the complete report for achieved accuracy by using different settings is shown in Table 11 . Evaluating Performance of India as a Control Group till 2020-07-11. Source: Author The performance of the validation over India shows that the exponential trendline is reliable in longterm predictions that seek accurate performance for periods that do not exceed two months, which can be very useful to decision-makers. However, it is possibly not credible for longer periods. These accuracy values particularly mean that the inherited randomness in the true input numbers is influencing the results. As well, the unique case and situation related to every country are inherited in the results of the models. However, this also means that extended long-term predictions (more than two months) can become a possibility while using limited data, which contributes to validating the accuracy of models in Brazil's case. Lastly, several options were tested before starting wide-scale training-saving of the models and they influenced the training: 1. Although scaling is a normal procedure in feature engineering for RNN applications, however; it is found through several validations that it has reduced the accuracy. 2. The code can generate overlapping time-series sequences to enlarge the training dataset, however; more accurate models were obtained by using the smaller data. 3. It was found that shuffling the data during the training of the models helped the training to generalize the results. 4. Filtering raw training data that is fed to the models, proved to be essential to avoid misleading the models by 'zero' instances that appeared early since 22-01-2020. 5. The non-deterministic mode can -within a certain range -provide a higher accuracy due to randomness, however; the patterns of trendlines are similar between the two modes. The models were built in Python programming language by Keras deep learning library on the Google Colab platform, and they depend only on numerical data of daily infections in each country, no other data has been added to the models except for the time series input. Mainly, to prepare the data and perform feature engineering, the (gen_rnn_data) function divides the raw dataset in a csv file for the training and evaluation data into training and testing sets, with an option to scale the data and to generate overlapping time-series data to enlarge the data. These two options were found to be not useful during the training of the models, so they were not used. As well, no feature engineering was implemented except for cropping the data to create the validation dataset and to avoid very small numbers as zero, or one, or two, which appeared at the starting point of the data (22-01-2020) till March. The cropping-points are indicated in Table 1 . The architecture of the models is based on predicting 15/20 time-steps (Y) by using 15/20 time-steps for training (X), which is equivalent to using 30/40 time-steps in total for training and testing the models. The choice of these time windows in the case of Brazil was determined experimentally. It was found that smaller time-steps such as 5 (5 for each of X and Y) and 10 (10 for each of X and Y) did produce less generalizability, which can be the same case with higher values as 25. Generally, the accuracy of all models was calculated using three R squared values (r2 -coefficient of determination): 1) the model's accuracy over the time-steps, 2) the model's accuracy over the whole evaluation period, and 3) the sum of both. The main functions used to train the models are: • (train_rnn): A recurrent neural network was chosen as the main architecture of the models as it can handle complex relations in time-series forecasting problems. For that, by using Keras deep learning library, the (train_rnn) function was developed to construct this RNN -by choosing one of these RNNs types: 1) LSTM (Long-Short-Term-Memory) and 2) GRU (Gated Recurrent Unit), along with a convolutional layer within a composite autoencoder's neural network. The choice of the GRU-CNN composite autoencoder is determined based on both, 1) GRU can provide faster convergence and higher performance than LSTM due to the reduced number of parameters in the model and simpler architecture, and 2) the boosted performance when using a hybrid architecture of CNN with RNN, which outperforms RNN alone without CNN. The latter case can be explained by both, CNN's capability of feature learning that identifies the important features in the input sequence as being the first layer, and RNNs capacity to detect temporal dependencies in the input that enables efficient forecasting of multivariate time-sequences. These two architectures were merged without batch normalization that is part of the basic block of CNNs and FCNs in certain applications as computer vision and time-series classification tasks, as it removes -normalizes -the noise in the data, which in return could destroy the dependencies or the relations between the inputs to the GRU autoencoder. The hybrid architecture ends with a dense layer and involves using linear activation function for generating numbers, rather than categorical activation functions such as softmax or sigmoid in traditional CNNs and FCNS when configured for classification tasks. • Later after training the model, (predict) recursive function can specify a threshold to include several time-steps predictions as shown in Figure 8 , (ex. to keep just the first 2 time-steps predictions out of the 15 time-steps), then it removes the rest and adds the selected ones to the end of the previous input sequence that is used to generate this output, while removing the same number of values at the beginning of the input, to update the input and re-predict the next 15 steps with the newly added values at the end of the input sequence. This loop can be repeated depending on how many time-steps we want to generate. This function allows us to recognize the inherited randomness in the data to explore how the model behaves on every threshold and to which limit it is accurate, so we can later select the best threshold that fits the unseen evaluation data. • The (adversarial_evaluation) function was developed to take the output of the model, and train it against daily updated actual numbers to evaluate the predicted numbers as near as possible to the actual numbers to increase the generalizability of the models. • The (predict_select_model) function can be used to test the performance of the models against many thresholds and each other, then it returns the best model with the best settings. • The (recursive_train_predict) function is the user's interface for the training -prediction process. • Helper functions, including 1) (eval_predictions) function that is used to evaluate the performance of the models. 2) (plot_predictions) function that is used to plot the graphs. 3) (train_ml) function that is used within the (adversarial_evaluation) function to perform linear regression operations. 4) (save_tables) function that is used to save the settings and accuracies of the models into tables of different formats and lengths. 5) (get_settings) function that is used to extract and process the settings of the models. 6) (save_data_structure) function that is used to save the folders and files correctly. 7) (select_models_acc) function that is used to filter models based on the accuracy threshold. 8) (select_best_models_acc) function that is used to filter out the best models. And 9) (random_seed_changer) function that is used to change and select a specific random seed across the used libraries in the Python environment to ensure the reproducibility of the results and the weights of the models during the training session. A note on (random_seed_changer) function: Initially, we should not confuse between deterministic settings and replicability, as the first does not necessarily lead to the latter. Generally, there are sources of randomness, since different hardware and software versions of used libraries in deep learning can produce randomness even when deterministic settings are used. For instance, there are four main sources for randomness, ex.: 1) GPU's numerical operations are mostly non-deterministic, so single-threaded CPU was used. 2) The programming environment and the initialization of the neural network can be random; therefore, random values were specified during the training session. 3) Different versions of deep learning libraries can provide different results by using the same settings, this was addressed by using "2.3.0-tf" version. 4) Training on different hardware can introduce randomness, thus, the models were trained on Google's Colab platform at a time-window that is indicated in Table 1 . This deterministic setup is useful to serve the procedural objective of the dataset, which is to report settings used to produce all models that were trained in a training session while using the same software and hardware. This can allow future work to evaluate the statistical significance based on no factors influenced the training except for changed settings of the models, which is essential later for uniform training of a reinforcement learning -RL model. This can provide further control over the stochastic process in the neural network, by using the stochastic data that was regulated by deterministic settings during its training process, as training data for an RL model, which is the major objective of the dataset. ** The 20 models in the Non-deterministic mode were all trained in 1 session. The Google Drive File Stream service that syncs files automatically from the training session on Google Colab to Google Drive, has created the designated folder on 3 July and indicated correct created date of 9 files out of 20 as: 3 July and incorrect modified date as: 25 June for these 9 files that even shows 24 June when downloaded. This error can be noticed in the compressed zip file. However, the code already included a working second layer of protection against these errors, as there is an internal collective settings dictionary file that states exactly all settings used at the beginning of the training process for each model before generating individual settings files, which is created on 3 July. The dictionary clarifies that the actual creation of settings that were used to initialize each training session of the 9 models, occurred on 3 July, as this dictionary creation code is responsible for generating universally unique identifiers as models' naming convention. The author declares that he has no known competing financial interests or personal relationships which have, or could be perceived to have, influenced the work reported in this article. Epidemiological data from the COVID-19 outbreak, realtime case information An interactive web-based dashboard to track COVID-19 in real time Forecasting the novel coronavirus COVID-19 COVID-19 outbreak reproduction number estimations and forecasting in Marche, Italy Data-based analysis, modelling and forecasting of the COVID-19 outbreak Application of the ARIMA model on the COVID-2019 epidemic dataset Forecasting the prevalence of COVID-19 outbreak in Egypt using nonlinear autoregressive artificial neural networks Evaluation and prediction of the COVID-19 variations at different input population and quarantine strategies, a case study in Guangdong province, China A data-driven analysis in the early phase of the outbreak Prediction of the COVID-19 spread in African countries and implications for prevention and control: A case study in South Africa AstraZeneca advances response to global COVID-19 challenge as it receives first commitments for Oxford's potential new vaccine Moderna and Catalent Announce Collaboration for Fill-Finish Manufacturing of Moderna's COVID-19 Vaccine Candidate COVID-19/time_series_covid19_confirmed_global.csv at master CSSEGISandData/COVID-19 Black Lives Matter May Be the Largest Movement in U.S. History Population Projection