key: cord-0793796-5rrlj4l6 authors: Xu, Lu; Magar, Rishikesh; Farimani, Amir Barati title: Forecasting COVID-19 new cases using deep learning methods date: 2022-02-23 journal: Comput Biol Med DOI: 10.1016/j.compbiomed.2022.105342 sha: 4989aac6e884db544ce03aa7c14db1059319bdca doc_id: 793796 cord_uid: 5rrlj4l6 After nearly two years since the first identification of SARS-CoV-2 virus, the surge in cases because of virus mutations is a cause of grave public health concern across the globe. As a result of this health crisis, predicting the transmission pattern of the virus is one of the most vital tasks for preparing and controlling the pandemic. In addition to mathematical models, machine learning tools, especially deep learning models have been developed for forecasting the trend of the number of patients affected by SARS-CoV-2 with great success. In this paper, three deep learning models, including CNN, LSTM, and the CNN-LSTM have been developed to predict the number of COVID-19 cases for Brazil, India and Russia. We also compare the performance of our models with the previously developed deep learning models and notice significant improvements in prediction performance. Although our models have been used only for forecasting cases in these three countries, the models can be easily applied to datasets of other countries. Among the models developed in this work, the LSTM model has the highest performance when forecasting and shows an improvement in the forecasting accuracy compared with some existing models. The research will enable accurate forecasting of the COVID-19 cases and support the global fight against the pandemic. The rampant spread of the COVID-19 pandemic has resulted in huge economic, human life loss and disruption of normal public life across the globe (Jacobsen, 2020) . According to the World Health Organization (WHO), over 200 million people have been infected by the SARS-CoV-2 virus worldwide ("World Health Statistics 2021," 2021). The virus is known to transmit between people through respiratory routes during human mobility (Wang and Liu, 2021) , increasing its transmissibility and making the general public susceptible. This correlation between human mobility and transmissibility of the virus has led to measures such as mandatory face coverings, social distancing, closing public transportation, schools, restaurants, and avoiding gathering have been imposed by governments across the world (Durán-Polanco and Siller, 2021) . The enforcement of such policies has helped in arresting the spread of the virus, yet its highly contagious nature coupled with the evolution of dangerous mutations has continued to ravage public human health. With the increasing number of patients, medical supplies are usually short of demand burdening the health care systems and professionals in many countries (Miethke-Morais et al., 2021) . Thus, understanding the spread and reliably forecasting the trends is one of the most crucial elements to prevent the spread of the pandemic, particularly in countries with a large population like India. Reliability in forecasting trends of the COVID-19 spread can help predict the pandemic outbreak and increase the preparedness of governments in tackling the pandemic. Moreover, accurate forecasting can provide feedback on whether the undertaken policy is effective in alleviating the stress on the healthcare system of that country. It also allows governments to evaluate mitigation strategies and regulate policies based on the forecasts of the areas in concern. For example, by applying mathematical models, such as SIR and SEIR models, researchers have successfully predicted the reproduction parameter of the COVID-19 in Indonesia for the early prevention of the pandemic, reinforcing the need for reliable forecasting models (Annas et al., 2020) . Recently, machine learning models have been extensively used for forecasting and can be especially useful in terms of pandemic planning. In this study, we develop a deep learning approach to forecast the pandemic trend for three countries including Brazil, India and Russia. These are among the top-10 most heavily affected countries worldwide and have been widely studied by healthcare experts. In this paper, we implement three different deep learning models, including the simple Convolutional neural network (CNN), Long short-term memory (LSTM) and Convolutional neural network-Long short-term memory (CNN-LSTM), to predict the number of cases and forecast the spread of COVID-19. The prediction performances of the three models are evaluated using mean absolute error (MAE), R 2 score and explained variance (EV) score. The LSTM model and the CNN-LSTM models perform comparably and have the lowest MAE for the countries that we consider in our study. Moreover, the LSTM model we developed outperforms some of the previously developed models (Zeroual et al., 2020 ) and hence we use it for forecasting the COVID-19 cases a week into the future. Using our model, we also reliably forecast the number of cases for the next 14 days, outside the training and test datasets. Our ML models incorporate the additional features like the different governmental policies in an effective manner developing a more informed deep learning-based forecasting model than the previous works. Our models contribute to the variety of tools available for COVID-19 forecasting, we believe that our models can help us improve our pandemic preparedness and tackle it more effectively. Machine learning models have been successfully used to understand the various aspects of the pandemic from developing machine learning models that can design antibodies , using medical image datasets, notably chest X-rays (Toğaçar et al., 2020, p.) , modeling and understanding mutations (Mullick et al., 2021; Y. Wang et al., 2020) , to detecting J o u r n a l P r e -p r o o f whether a patient is infected by SARS-CoV-2 to forecasting the trends of the pandemic. In addition, some short-term forecasting methods, including SutteIndicator, which is widely used to predict the stock price based on the previous days' data (Attanayake and Perera, 2020) ; SutteARIMA, which averages the forecasting results of the α-Sutte Indicator and ARIMA (Ahmar and Boj, 2020) ; and Holt-Winters, which can capture three important aspects of the time-series data: the average, trend and seasonality (Sharma and Nigam, 2020) have been applied to predict the development of the pandemic. In this work, we focus on forecasting the pandemic trends for different countries namely Brazil, India and Russia, because these are the countries that have been widely studied. In this case, we can compare our results with those in previous studies. Previously conducted forecasting studies using machine learning pertinent to these countries have been noted in this section. Brazil being one of the most heavily affected countries due to the pandemic, has been widely studied by researchers. Ribeiro et al., used autoregressive integrated moving average (ARIMA), cubist regression, random forest, ridge forest, SVR and stacking-ensemble learning, respectively to analyze the cumulative confirmed cases in Brazil (Ribeiro et al., 2020) . With the comparison of forecasting performance, they concluded that SVR performs the best with an error of less than 6.9%. Another study using training on limited data of 30 days and 40 days, respectively was conducted to predict the rate of spread in Brazil using the Gated Recurrent Unit (GRU) (Hawas, 2020) . They observed that the highest accuracy of 85% has been achieved on the time-step of 30 days using the validation data from 4/7/2020 to 6/13/2020. However, the accuracy drops markedly (a maximum of 68%) as the predicting period increases, indicating that the model behaves relatively poorly in a long-time range Apart from Brazil and India, another country that has been widely studied is Russia. Wang et al. developed an LSTM model to forecast trends of the pandemic in 150 days ahead using the daily new confirmed cases in Russia, Peru and Iran (P. . In another study, the Bayesian model has been applied to investigate the effects of lockdowns on the COVID-19 transmission using the data from March 1 to June 29, 2020 in the top five countries (India, Brazil, Russia, USA and UK). It was demonstrated that the outbreak pace will significantly increase in Brazil, India and Russia once loosening the lockdowns (Feroze, 2020) . Dairi The Center of System Science and Engineering (CSSE) at John Hopkins University has aggregated the COVID-19 cases data from 22 January 2020 till date for around 210 countries across the world (CSSEGISandData, 2021). In our study, we analyze the data of three highly impacted countries: Brazil, India and Russia. The trend of the cumulative number of cases for the countries that we study is shown in Fig. S1 . To account for the delay between the COVID test and report results and updating of cases on the portal, we apply a smoothening 7-day average (Fig. S2 ) and assign it to the day where 0 cases were reported. This way, we ensure that the data is stable and the days where there were no cases reported will be eliminated. To get the features, such as face coverings, restrictions on gatherings, closing public transportation and staying at home, and the overall stringent index, which is from 0 to 1000, were considered. We use data from Our World in Data Server, as shown in Table 1 ("Our World in Data," 2021). where and refer to the minimum and maximum of input data. Apart from the features enlisted in table 1 we also used previous day's data as a feature to the model. After the data preprocessing, three deep learning models including CNN, LSTM, CNN-LSTM are implemented. The performances of the three models are compared and the best performing model is selected to forecast cases of future 7 days. The flowchart of this work is illustrated in Fig. 1 . Table 2 . We use ReLu as the activation function for the non-linear transformation and two fully connected layers are implemented at the end of the model (Nair and Hinton, 2010) . To train the CNN model, we use the data from January 1 to July 13, 2021. We train the model for 500 epochs and observe convergence as the loss does not decrease substantially when we train for more than 300 epochs. The details for the hyperparameter optimization are provided in SI (Table S1 & Figure S4 ) and the final architectural parameters are presented in table 2. The plot of loss in CNN training vs the number of epochs is shown in Fig. 3 . We notice that the training loss for Brazil is higher than the other two countries. This may be due to the fact that the number of cases in Brazil fluctuates more than that of Russia and India, causing the higher loss. In addition to CNN, multiple studies have used the long short-term memory (Hochreiter and Schmidhuber, 1997) Table S2 and Figure S4 . We also investigate the CNN-LSTM model that takes advantages of both the CNN and LSTM models, where the CNN part is extracting important features from the data and the LSTM is designed to learn sequence patterns in time-series data. Specifically, CNN first extracts features from the training set through convolutional and pooling layers and generates an embedding. This embedding from CNN is then fed as an input to the LSTM. LSTM with its ability to capture the time dependencies in the input data takes the features extracted by the CNN as input and predicts the number of cases. The architecture of the CNN-LSTM model is shown in Fig. 2 and detailed parameters for the model are available in Table 2 . We train the CNN -LSTM model for 300 epochs. The training loss curve for the model is shown in Fig. 3 . To compare the performance of three models quantitively, evaluation metrics: Mean Absolute Error (MAE), Coefficient of determination ( 2 ) are calculated as: where is the actual case and ̂ is the predicted cases. In addition, we also use the explained variance as a metric to evaluate the performance of the models. The model with the least MAE and highest R2 score and EV score is considered the best architecture and prepared to forecast the COVID-19 transmission. The results of the three models are presented with details in the following sections. We analyze the prediction performance of the three deep learning models on data from three countries -India, Brazil and Russia. The model performances are trained on data from 1 st January to 13 th July 2021 and evaluated using the test data from 14 th July to 31 st July 2021. The prediction performance of the models on the test data for all the models is shown in figure 4. The prediction performances are evaluated quantitatively using metrics (MAE, 2 , EV) and also validate models in detail. As summarized in is observed that 2 of LSTM and CNN-LSTM is relatively high for cumulative cases prediction, especially for India and Russia (near 1), indicating that the predicted cases closely follow the trend of the true cases ( Fig. 5(a) ). Similarly, we also evaluate the EV score ( Figure 5 (b)), and it was observed that EV for LSTM and CNN-LSTM models is usually higher. Although the MAE for CNN-LSTM in the prediction set is relatively lower, it must be noted that the MAE for LSTM is not very different for Brazil and Russia when compared to CNN-LSTM and it also has a higher R 2 score for cumulative cases prediction than CNN-LSTM for India and Russia indicating its strong performance. J o u r n a l P r e -p r o o f Fig. 5 . a.) 2 score calculated for all models. We use the total cases predicted by the model vs the actual total cases for calculating the R 2 score b.) We also calculate the explained variance score for all the models for the three different countries. For governments to prevent and control the pandemic, MAEs are calculated for cases in Brazil where each of the following governmental measures is not considered in the model. According to the results shown in Table S6 , the order of importance is no gatherings, face coverings, closure of public transportation and stay at home. Therefore, governments are expected to emphasize no gatherings to mitigate the spread of the COVID-19. In order to illustrate the reliability of our models, we have compared our results with those of previous studies using the same data from January 22, 2020 to July 17, 2020 (Zeroual et al., 2020) . India c) Russia. In summary, our LSTM model can successfully forecast the trend of the cumulative cases and predict the daily new cases for countries with the relatively stable transmission. However, for countries with the rapidly changing number of cases, the model may have difficulties in capturing the most recent changing trends. In this case, we may need to train on a larger quantity of data to achieve more accurate results. To ensure that our analysis is exhaustive we also performed a similar forecasting analysis for CNN-LSTM, the results for which are noted in Figure S5 . It was observed that the forecasting performance of LSTM model is slightly superior when compared to CNN-LSTM. J o u r n a l P r e -p r o o f In this study, we have implemented three deep learning models and compared their predicting performances for forecasting the COVID-19 cases for three countries -Brazil, India and Russia. All three models successfully capture the transmission trend in each country. We observe that the LSTM model has the best performance based on the results of evaluation metrics MAE, 2 and EV. We would also like to note that our model also shows an improvement with a reduced error compared with previous studies that use deep learning for predicting the SARS-CoV-2 cases. Using the best performing LSTM model, we then forecast the COVID-19 cases of 7 days outside the training and the test dataset for the three countries in the study. Developing such models can be crucial in pandemic planning and helping tackle the COVID-19 more effectively. In addition to the studied countries, the proposed models and training strategies can also be applied for the other countries and also can help in assessing the effectiveness of the policies that are being imposed to curb the spread of the virus. In the future, we aim to integrate additional information about the types of SARS-CoV-2 variants, vaccination status of citizens, healthcare infrastructure, etc. as features to further improve the model capacity and performance. Will COVID-19 confirmed cases in the USA reach 3 million? A forecasting approach by using SutteARIMA Method Stability analysis and numerical simulation of SEIR model for pandemic COVID-19 spread in Indonesia Forecasting COVID-19 Cases Using Alpha-Sutte Indicator: A Comparison with Autoregressive Integrated Moving Average (ARIMA) Method COVID-19 Data Repository by the Meteorological and human mobility data on predicting COVID-19 cases by a novel hybrid decomposition method with anomaly detection analysis: A case study in the capitals of Brazil Comparative study of machine learning methods for COVID-19 transmission forecasting Artificial Neural Networks architectures for stock price prediction: comparisons and applications Crowd management COVID-19 Forecasting the patterns of COVID-19 and causal impacts of lockdown in top five affected countries using Bayesian Structural Time Series Models Transfer Learning for COVID-19 cases and deaths forecast using LSTM network Generated time-series prediction data of COVID-19′s daily infections in Brazil by using recurrent neural networks Long Short-Term Memory Will COVID-19 generate global preparedness? Projecting the criticality of COVID-19 transmission in India using GIS and machine learning methods Time series prediction of COVID-19 transmission in America using LSTM and XGBoost algorithms. Results Phys. 27, 104462 Potential neutralizing antibodies discovered for novel corona virus using machine learning COVID-19-related hospital cost-outcome analysis: The impact of clinical and demographic factors Understanding mutation hotspots for the SARS-CoV-2 spike protein using Shannon Entropy and K-means clustering Rectified Linear Units Improve Restricted Boltzmann Machines 8. Our World in Data A Comparison: Prediction of Death and Infected COVID-19 Cases in 5th International Conference on Computer Science and Computational Intelligence Short-term forecasting COVID-19 cumulative confirmed cases: Perspectives for Brazil Modeling and Forecasting of COVID-19 Growth Curve in India Implementation of stacking based ARIMA model for prediction of Covid-19 cases in India COVID-19 detection using deep learning models to exploit Social Mimic Optimization and structured chest X-ray images using fuzzy color and stacking approaches Predicting the time period of extension of lockdown due to increase in rate of COVID-19 cases in India using machine learning Time series prediction for the epidemic trends of COVID-19 using the improved LSTM deep learning method: Case studies in Russia, Peru and Iran On the Critical Role of Human Feces and Public Toilets in the Transmission of COVID-19: Evidence from China Bio-informed Protein Sequence Generation for Multi-class Virus Mutation Prediction World Health Statistics 2021: A visual summary Multi-hour and multi-site air quality index forecasting in Beijing using CNN, LSTM, CNN-LSTM, and spatiotemporal clustering Deep learning methods for forecasting COVID-19 time-Series data: A Comparative study A Novel Coronavirus from Patients with Pneumonia in China, 2019 | NEJM [WWW Document This work is supported by the start-up fund provided by CMU Mechanical Engineering.