key: cord-0948054-nl91dn9t authors: Hridoy, A.-E. E.; Naim, M.; Emon, N. U.; Tipo, I. H.; Alam, S.; Al Mamun, A.; Islam, M. S. title: Forecasting COVID-19 Dynamics and Endpoint in Bangladesh: A Data-driven Approach date: 2020-06-28 journal: nan DOI: 10.1101/2020.06.26.20140905 sha: ca91b8154c89d402370ca21bc799475fcfde6dc9 doc_id: 948054 cord_uid: nl91dn9t On December 31, 2019, the World Health Organization (WHO) was informed that atypical pneumonia-like cases have emerged in Wuhan City, Hubei province, China. WHO identified it as a novel coronavirus and declared a global pandemic on March 11th, 2020. At the time of writing this, the COVID-19 claimed more than 440 thousand lives worldwide and led to the global economy and social life into an abyss edge in the living memory. As of now, the confirmed cases in Bangladesh have surpassed 100 thousand and more than 1343 deaths putting startling concern on the policymakers and health professionals; thus, prediction models are necessary to forecast a possible number of cases in the future. To shed light on it, in this paper, we presented data-driven estimation methods, the Long Short-Term Memory (LSTM) networks, and Logistic Curve methods to predict the possible number of COVID-19 cases in Bangladesh for the upcoming months. The results using Logistic Curve suggests that Bangladesh has passed the inflection point on around 28-30 May 2020, a plausible end date to be on the 2nd of January 2021 and it is expected that the total number of infected people to be between 187 thousand to 193 thousand with the assumption that stringent policies are in place. The logistic curve also suggested that Bangladesh would reach peak COVID-19 cases at the end of August with more than 185 thousand total confirmed cases, and around 6000 thousand daily new cases may observe. Our findings recommend that the containment strategies should immediately implement to reduce transmission and epidemic rate of COVID-19 in upcoming days. Coronavirus disease 2019 , the ongoing worldwide pandemic, is a highly contagious disease that predominantly causes respiratory complications ranging from the cough to acute respiratory distress syndrome. At the time of writing this, it has claimed more than 4,46000 lives, spread in 213 countries around the world. Since the 8 th December 2019, atypical pneumonia cases have identified in Wuhan City, Hubei province, China (Zhu et al., 2020) , later this disease broke out within China and continued to spread all over the world, causing startling panic worldwide. Since then, it poses unprecedented challenges to global health and a challenging task for the scientific community to identify its behavior. In the following months, the World Health Organization (WHO) declared it a global pandemic due to its highly contagious behavior (Wang, Wang, Ye, & Liu, 2020) . The virus severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2), named by the International Committee on Taxonomy of Viruses (ICTV), that causes COVID-19 belongs to genus Beta coronavirus, and it is a positive-sense single-stranded RNA (+ssRNA) virus and thus can mutate frequently to cope up with changing environments (Kaneko, Nimmerjahn, & Ravetch, 2006) . The mean incubation period of COVID-19 is estimated 6.5 days (2-14 days) while patients remain asymptomatic or experience a little symptom and thus spreading the virus silently (Lai, Shih, Ko, Tang, & Hsueh, 2020; Rothan & Byrareddy, 2020) . As of now, there is no proven vaccine of COVID-19, that altogether contributed to its seriousness. Symptoms of COVID-19 vary from person to person. However, common symptoms include respiratory stress, fever, cough, shortness of breath, and breathing difficulties (Yang et al., 2020) . It is estimated that the case-fatality rate of COVID-19 is 2.3%, but older people with comorbidities, the rate is higher (Novel, 2020) . WHO (2020) recommended that preventive measures for COVID-19 includes maintaining social distancing, washing hands frequently, avoiding touching the mouth, nose, and face. The first three cases of COVID-19 was reported . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 28, 2020 . . https://doi.org/10.1101 in Bangladesh on 8 th March 2020 origin from Italy. It spreads to the maximum of the districts of the country within a month. After the confirmation of the first three cases of COVID-19 on 8 th March 2020, the government of Bangladesh declared a 10 days lockdown from 26 th March 2020 to 4 th April 2020. However, lockdown periods were later extended several times until 30 th May 2020, and then zone-based lockdown strategies have been taken. In addition, All the educational institutions closed down from 17 th March 2020 and still ongoing. The government of Bangladesh had suspended all national and international flights from 30 th March 2020 and reopened international flights on a small scale from 16th June 2020. At the time of writing this, the confirmed cases in Bangladesh have surpassed 100 thousand and more than 1343 death toll. The administration is struggling to accommodate COVID-19 patients. Therefore, some predictive models will be an excellent tool to predict future cases in upcoming days (Tobías, 2020; . The COVID-19 data depends on non-. CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 28, 2020 . . https://doi.org/10.1101 linear function; thus, using linear methods might fail to capture the COVID-19 dynamics (Chimmula & Zhang, 2020) . Several statistical methods such as Auto-Regressive Moving Average (ARIMA), Moving Average (MA), Auto-Regressive (AR) methods have been used to understand the COVID-19 dynamics (Benvenuto, Giovanetti, Vassallo, Angeletti, & Ciccozzi, 2020; Dehesh, Mardani-Fard, & Dehesh, 2020) but these methods did not fit well COVID-19. To overcome such barriers, we adopted data-driven Deep Learning-based LSTM networks and Logistic Curve fitting methods to predict future likely cases and deaths. The proposed methods approximately fit real data; thus, policymakers can prepare to take the rush of COVID-19 in upcoming days. Besides, early prediction using mathematical models would help government officials of Bangladesh and also the health administration of Bangladesh to prepare beforehand and take the precautionary measures to compensate for the causalities of . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 28, 2020. . https://doi.org/10. 1101 Ahmed, Hossain, Haque, and Hossain (2020) using SIR models projected final infection size in Bangladesh in assuming eight different scenarios. al Azad and Hussain (2020) using the SIR model estimated that the of COVID-19 in Bangladesh as of May is 2.88. However, reviewing the literature, it is observed that a data-driven approach to the prediction of COVID-19 dynamics is limited in Bangladesh. Therefore, this study adopted a deep learning method LSTM and Logistic Curve model to predict likely future cases of COVID-19, its peak time, and the plausible end-date of the COVID-19 pandemic in Bangladesh. Our analysis may help policymakers and health professionals to take necessary measures toward COVID-19 in upcoming months. The layout of this article is as follows: In section 2, similar studies have been presented. In section 3, methods and models have been explained. In section 4, results and discussion has been discussed, and in section 5, concluding remarks have been presented. Several works had been done globally with Machine learning approaches since the beginning of the pandemic COVID-19, some of them are mentioned below: Chimmula and Zhang (2020) , using the LSTM model, they predicted the possible ending point of the Pandemic COVID-19 outbreak in Canada would be around June 2020. Kafieh et al. (2020) used a different machine learning approach to predict the future number of cases in Iran and found that M-LSTM was the most accurate model for their study. . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 28, 2020. The COVID-19 data used in this research is collected from Johns Hopkins University's GitHub repository (https://github.com/CSSEGISandData/COVID-19). The dataset also includes the number of total confirmed cases, fatalities, and recovered patients by the end of each day. This study has been used the COVID-19 data of Bangladesh from 8 th March 2020 to 13 th June 2020 for LSTM models. The deep learning models are trained with 98 occurrences of total confirmed, recovered and death COVID-19 cases. The dataset is divided into 97% training set and 3% test set. For the Logistic Curve fitting model, data from 8 th March 2020 to 18 th June 2020 has been used. A sequence of data points over a regular interval of time is called time-series (TS). In the regression predictive modeling approach, it is desired that the observation is in temporal structure. In other words, it will remain consistent over time. In TS terminology, the consistency is referred to as time series being stationary, which indicates it has constant mean . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 28, 2020. . https://doi.org/10. 1101 and variance in respect to time. These are important characteristics in TS; however, it can be easily violated by presences of trend, seasonality, and error in TS. A TS is said to have a trend when a certain pattern repeats over regular intervals of time. A non-stationary time series has trends, seasonal effects; as a result, its statistical properties change over time. Time series datasets of COVID-19 may have non-stationary patterns due to lockdown, social distancing, etc. Thus, it is important to know the nature of TS before applying forecasting methods on the given TS dataset. This study adopted the Augmented Dickey-Fuller (ADF) test (Cheung & Lai, 1995) to check the nature of TS. ADF test checks the stationarity of TS dataset. ADF is a unit root test that checks the impacts of trends in a given TS. The results are interpreted by the p-value from the test. If the p-value is between a threshold of (1-5) %, it suggests that we can reject the null hypothesis (i.e., it does not have a unit root, and it can be regarded as stationary series); otherwise, a p-value above the threshold suggests we fail to reject the null hypothesis (i.e., it does have a unit root, and it can be regarded as nonstationary series). The ADF test of our study suggests we fail to reject the null hypothesis; the data has a unit root, and data is non-stationary. Recurrent neural network (RNN) is a type of artificial neural network, which is used in temporal domains to learn sequential patterns. Recurrent LSTM networks can address the limitations of traditional time series forecasting techniques by adapting nonlinearities of given COVID-19 dataset and can result in a state of the art results on temporal data (Chimmula & Zhang, 2020) . The Long Short-Term Memory model (LSTM) (M. Zhang, Geng, & Chen, 2020; Q. Zhang, Gao, Liu, & Zheng, 2020) is an advancement from the recurrent neural network. However, RNN suffers from vanishing gradient problems, meaning networks cannot learn from long data sequences. To overcome this barrier, the LSTM was . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 28, 2020. . https://doi.org/10. 1101 proposed by (Hochreiter & Schmidhuber, 1997) . Each block of LSTM operates at different time steps and passes its output to the next block until the final LSTM block generates the sequential output (Figure 2) . The basic structure of LSTM consists of four gates-input gate, forget gate, control gate and output gate. The input gate decides which information can be transferred from the previous cell to the current cell. The input gate is defined as The forget gate decides it should store the information from the input of previous memory or not. The forget gate is defined as The control gate controls the update of the cell and is defined as tanh ℎ , * * . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 28, 2020. . https://doi.org/10. 1101 Finally, the output gate updates the hidden layer ℎ and also updates the output and is given by the following equations: In the above equations, is the input and are the corresponding weight matrices and bias, respectively; and are the previous and current block memory, respectively; ℎ and ℎ are previous and current blocks the output, respectively. Moreover, ℎ is the hyperbolic tangent function that used to scale the values into range -1 to 1 and is the sigmoid activation function, which gives the output in between 0 to 1. The Logistic Curve model is widely used to describe the growth of a population. An infectious disease outbreak like COVID-19 can be seen as the growth of the population of a pathogen agent; thus, a logistic model seems reasonable to model the progression of the agent (Bertolaccini & Spaggiari, 2020) . The most generic expression of a logistic function is given below: , , , Where is the time and , , are parameters; refers to the infection speed, is the day with the maximum infections occurred and is the total number of recorded infected people at the infection's end. Evaluating the model accuracy is a crucial part of a machine learning model, which gives us the idea of how well the model is predicting. There are many types of evaluation matrices and vary according to the model type. It is wise to evaluate a model with multiple metrics to . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 28, 2020. . https://doi.org/10.1101/2020.06.26.20140905 doi: medRxiv preprint ensure the correct and optimal operation of the model. In our study, the following evaluation matrices have been considered: MSE metric represents the difference between the true and predicted values extracted by squared the average difference over the dataset. The equation of MSE is given below: RMSE is nothing but the error rate by the square root of MSE and is defined by The or the coefficient of determination metric represents the coefficient of how well the predicted values fit compared to the true values of the dataset. The value from 0 to 1 interpreted as percentages where higher value means the better model. The equation is given below: In the above equations, is indexed value, and are the predicted value and the mean value of , respectively. . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 28, 2020. . https://doi.org/10.1101/2020.06.26.20140905 doi: medRxiv preprint MAPE metric is used to measure the size of the error in percentage terms regarding the actual values and is defined as where is the actual value and is the corresponding estimated value for th sample from all n available samples. In the following section, we present the experimental results for the validation and analysis of the proposed model introduced in this study. We hope that our approaches and predictive models will shed some light on the COVID-19 dynamics in Bangladesh and may help us to be aware of upcoming undesired circumstances and take necessary measures in advance and possible suppression of the COVID-19 Pandemic in Bangladesh. This study has been used the COVID-19 data of Bangladesh from 8 th March 2020 (When the first case of COVID-19 was registered in Bangladesh) to 13 th June 2020. The deep learning models are trained with 98 occurrences of total confirmed, recovered and death COVID-19 cases. The dataset is divided into 97% training set (from 8 th March 2020 to 10 th June 2020) and 3% test set (from 11 th June 2020 to 13 th June 2020). For Logistic Curve fitting data from 8 th March 2020 to 18 th June 2020 has been considered. The performance evaluation of LSTM models is examined with mean square error (MSE), root mean square error (RMSE) and mean absolute percentage error (MAPE). Before selecting the best-performed LSTM model, . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 28, 2020 . . https://doi.org/10.1101 we considered three types of LSTM models (i.e. Vanilla, Stacked, Bidirectional) for univariate time series forecasting. We executed each model for total confirmed, recovered and deceased data, then the best-performed model is considered based on performance metrics. The performance results of different LSTM models are described in Table 1 . . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 28, 2020. . https://doi.org/10. 1101 bidirectional LSTM performed better than others; the RMSE is 20.16, score 0.75, and accuracy 98.75%. LSTM results suggest that in the near future, the total numbers of confirmed cases will continue to grow exponentially. Thus, strict-lockdown and maintaining social distancing is necessary to reduce the transmission of COVID-19. The predicted data from LSTM model is very close to actual. . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 28, 2020. . https://doi.org/10.1101/2020.06.26.20140905 doi: medRxiv preprint Traditional infectious disease modeling uses differential equations based on the dynamics of the diseases where too many factors are also associated; thus, it often leads to over-fitting. It is observed that the Logistic Curve fits the COVID-19 data of Bangladesh very well. The parameters of the Logistic Curve are obtained by using a non-linear optimization algorithm, and results are shown in Table 2 . The COVID-19 data of Bangladesh from 8 th March 2020 to 18 th June 2020 has been considered for Logistic Curve fitting. As of 18 th June 2020, the total confirmed cases of COVID-19 in Bangladesh are 102,292 and ranked in 17 th most infected countries in the world. The number of recorded cases has increased dramatically between 40-50 days since the first confirmed cases, which represents the sudden changes from where the number of infected cases started following an exponential trend. It is observed that the Logistic curve fitting model (LG) approximates the total number . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 28, 2020 . . https://doi.org/10.1101 of confirmed cases very well. The fitting parameters are included in Table 2 , where a is the infection speed, b is the day with maximum infections occurred and c refers to the total number of confirmed infected people at the infection's end and is the fitting goodness of the total confirmed cases. Logistic Curve fitting on the total number of confirmed cases The logistic curve of Figure 4 suggests that Bangladesh is approaching its peak. Since the data is approximately fitting with the curve, this study suggests that assuming the people will maintain social distancing and the government of Bangladesh will keep various kinds of interventions, at the end of June 2020, there will be 135-140 thousand total confirmed cases. At the end of July 2020, there will be 178-180 thousand total confirmed cases, and at the end of August 2020, there will be more than 186 thousand confirmed cases, where the attainment of peak cases will be reached. Figure 4 also suggests that the curve will flatten at the . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 28, 2020. . https://doi.org/10. 1101 beginning of September 2020 if strict-lockdown in place. The plausible end of the infection is at the beginning of January 2021 with total confirmed cases in between 187 to 193 thousand. Turning point or inflection point is the point where the exponential growth of transmission rate starts to decrease; more formally, its concavity changes from upwards to downwards. It is the midpoint of the spread of an infectious disease. In order to know the inflection point, Figure 4 and 5, the Logistic Curve predicted that the highest peak . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 28, 2020. . https://doi.org/10. 1101 will be around at the end of August 2020 with more than 186 thousand confirmed cases, where daily new cases may surpass more than 6,000. However, it is worth noting that transmission of the COVID-19 depends on several exogenous factors such as lockdown, social-distancing; thus, this prediction might slightly vary to reality. Logistic model used to predict the total death toll in Bangladesh in upcoming days. Logistic Curve fitting on the total number of death cases. It is observed that the Logistic model approximates the total number of death tolls very well. The fitting parameters are included in Table 2 , where a is the death speed, b is the day with maximum deaths occurred, and c refers to the total number of deaths at the infection's end and is the fitting goodness of the total deaths. The LG curve suggests that at the end of . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 28, 2020. . https://doi.org/10. 1101 June 2020, there will be around 200 death tolls, and at the end of July 2020, there will be more than 3150 death tolls. The maximum death toll will reach around the end of August 2020, with more than 3600 total death cases. Since the number of total deaths depends on the health care capacity of a country, we suspect this number will vary to reality. In this study, a data-driven forecasting/estimation method has been adopted to estimate possible total confirmed and deceased cases in upcoming months. The rate of transmission in Bangladesh, still, is following exponential growth despite taking several intervention strategies by the government of Bangladesh. This is the first study, to best of our knowledge, to model COVID-19 transmission in Bangladesh using a deep learning approach. However, it worth noting that despite of having small-dataset, our model gave promising accuracy. Thus, these models will help policymakers to understand the future of the COVID-19 trajectory in Bangladesh and implementing interventions as well as estimate the impact of interventions. Our LSTM models can give valuable insights into the COVID-19 transmission in Bangladesh and possibly predict future outbreaks. In addition, we also adopted the Logistic curve fitting model to estimate confirmed cases and the death toll. However, it worth noting that the prediction and conclusion drawn from the results are under the preconditions that control measures for the COVID-19 in Bangladesh are stable and reliable, and the virus will not mutate as well in the future. The conclusion has drawn from fitting 103 days data where 65 days (25 th March 2020 to 30 th June 2020) of nation-wide lockdown was in place. The prediction will improve if new data are generated in the upcoming days. . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 28, 2020. . https://doi.org/10. 1101 Modeling and Analysis of The Early-Growth Dynamics of COVID-19 Transmission Application of the ARIMA model on the COVID-2019 epidemic dataset The hearth of mathematical and statistical modelling during the Coronavirus pandemic Lag order and critical values of the augmented Dickey-Fuller test Time series forecasting of COVID-19 transmission in Canada using LSTM networks Forecasting of covid-19 confirmed cases in different countries with arima models How to make more from exposure data? An integrated machine learning pipeline to predict pathogen exposure Long short-term memory Artificial intelligence forecasting of covid-19 in china COVID-19 in Iran: A Deeper Look Into The Future Anti-inflammatory activity of immunoglobulin G resulting from Fc sialylation. science Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and corona virus disease-2019 (COVID-19): the epidemic and the challenges The reproductive number of COVID-19 is higher compared to SARS coronavirus The epidemiological characteristics of an outbreak of 2019 novel coronavirus diseases (COVID-19) in China. Zhonghua liu xing bing xue za zhi= Zhonghua liuxingbingxue zazhi Impact of control strategies on COVID-19 pandemic and the SIR model based forecasting in Bangladesh Real-time forecasts of the COVID-19 epidemic in China from February 5th to February 24th, 2020. Infectious Disease Modelling The epidemiology and pathogenesis of coronavirus disease (COVID-19) outbreak Evaluation of the lockdowns for the SARS-CoV-2 epidemic in Italy and Spain after one month follow up Prediction for the spread of COVID-19 in India and effectiveness of preventive measures Predicting the ultimate outcome of the COVID-19 outbreak in Italy A review of the 2019 Novel Coronavirus (COVID-19) based on current evidence Real-time estimation and prediction of mortality caused by COVID-19 with patient information based algorithm Influenza-like illness prediction using a long short-term memory deep learning model with multiple open data sources Semi-Supervised Bidirectional Long Short-Term Memory and Conditional Random Fields Model for Named-Entity Recognition Using Embeddings from Language Models Representations Public Environment Emotion Prediction Model Using LSTM Epidemic growth and reproduction number for the novel coronavirus disease (COVID-19) outbreak on the Diamond Princess cruise ship from