key: cord-1040790-0ozqja6w authors: Singh, D. K. K.; Kumar, S.; Dixit, P.; Bajpai, D. M. K. title: Kalman Filter Based Short Term Prediction Model for COVID-19 Spread date: 2020-06-03 journal: nan DOI: 10.1101/2020.05.30.20117416 sha: 73924069b04d70bd1ef34a879ed7001f63c1111c doc_id: 1040790 cord_uid: 0ozqja6w COVID-19 has emerged as global medical emergency in recent decades. The spread scenario of this pandemic has shown many variations. Keeping all this in mind, this article is written after various studies and analysis on the latest data on COVID-19 spread, which also includes the demographic and environmental factors. After gathering data from various resources, all data are integrated and passed into different Machine Learning Models to check the fit. Ensemble Learning Technique,Random Forest, gives a good evaluation score on the test data. Through this technique, various important factors are recognised and their contribution to the spread is analysed. Also, linear relationship between various features is plotted through heatmap of Pearson Correlation matrix. Finally, Kalman Filter is used to estimate future spread of COVID19, which shows good result on test data. The inferences from Random Forest feature importance and Pearson Correlation gives many similarities and some dissimilarities, and these techniques successfully identify the different contributing factors. The Kalman Filter gives a satisfying result for short term estimation, but not so good performance for long term forecasting. Overall, the analysis, plots, inferences and forecast are satisfying and can help a lot in fighting the spread of the virus. name refers to the characteristic appearance of virions by electron microscopy. This is due to the 'viral' spikes which are proteins on the surface of the virus. It causes respiratory throat infections in humans. COVID-19 is large spherical particle with surface projections. The average diameter is approximately 120 nanometer (nm) [1] . Envelope diameter is 80 nm while spikes are 20 nm long. The viral envelope consists of a lipid out layer where the membrane, envelope and spike structural proteins are anchored. Inside the envelope, there is the nucleocapsid, which is formed of proteins, single stranded RNA genome. COVID-19 contains SSRNA [2] . The genome size ranges from 26.4-31.7 ab. It also contains "polybasic cleavage site" which increases pathogenicity and transmissibility. The spike protein is responsible for allowing the virus to attach and fuse with the membrane of a host cell. COVID-19 has sufficient affinity to the receptor angiotensin converting enzyme on human cells to use them as mechanism of cell entry [3] . The cell's trans-membrane protease serine cuts open the protein and exposing peptide after a COVID-19 virion attaches to a target cell. The virion then releases RNA into the cell and forces the cell to produce and disseminate copies of the virus [4] . The RNA genome then attaches to host cells ribosome for translation. The basic reproductive number (RO) of the virus has been estimated to be between 1.4 to 3.9. The infection transmits mainly through droplets released from infected person or surfaces containing these infected droplets [5] . The detection and controlling of any infection based disease outbreaks have been major concern in public health [2] . The data collection from different sources such as health departments, emergency department, weather information etc plays vital role in decision making to control the epidemic. It has been well established in literature that data sources contains important information that helps to current public health status [6, 7] . Now, it becomes important that world health authorities and the global population remain vigilant against a resurgence of virus. Situational Information in social media during COVID-19 has been studied [8] . Markov switching model has been used to detect the disease outbreak [9] . Prediction model of HIV incidence has been proposed using neural network [10] . There are many methods exiting in literature to early detection of different diseases [11, 12, 13] . In case of any epidemic, the early prediction play vital role to control the epidemic. The Government agencies as well as public health organizations will prepare early according to the prediction. There are many prediction algorithms available in the literature for different types of data [2, 9] . Kalman filter is one of the popular filter to study of multivariable systems, highly fluctuated data, time varying systems and also suitable to forecast random processes [14] . Yang and Zhang used Kalman filter for prediction of stock price [15] . They have used Changbasihan as a test case to predict the stock price [15] . The extended Kalman filter in nonlinear domain has been studied by Iqbal et al [16] . Kalman filter is very useful in the field of Robotics [19] . It is extensively used to optimize the robotic movements, tracking of robots and their localization [19] . Kalman filter is also used in supply chain as abstraction [17] . It is also useful in manufacturing process to improve the capability of overall process [18] . Another useful application of Kalman filter was reported in literature to estimate parameters of train that is coating on a flat track [19] . The structural and replication mechanism followed by COVID-19 have motivated the study of demogramic and environmental factors affecting its spread. The present manuscript has included many such factors for the study e.g. Minimum Temperature, Maximum Temperature, Humidity, Rain Fall etc. The effect of these factors has been studied thoroughly on the spread rate. The effect is also studied on death rate and active cases rate. The Kalman filter tries to predict the state ∈ of a discrete-time dependent process that is controlled by linear stochastic difference equation: With a measurement ∈ which is = + The random variable represents the process noise and represents the measurement noise. These both variables are assumed to be independent of each other with normal probablity Here, Q is the process noise covariance matrix and R is the measurement noise covariance matrix. A is matrix which establishes relationship between the state at the previous time step and the state at the current time step, when the driving function or process noise is absent. B is 1matrix which establishes relationship between the optional control input ∈ 1 and the state x. G is matrix which establishes relationship between the state and the measurement . . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) We can narrow the focus to the specific equations and their use in this version of the filter. The Kalman filter does estimation of a process by a kind of feedback control: the filter predicts the process state at some time and then accepts feedback in the form of measurements. Hence, the Kalman filter algorithm can be divided into two parts: Here,~is the predicted mean on time step i before seeing the measurement and is the estimated mean on time step i after seeing the measurement. ~i s the predicted covariance on time step i before seeing the measurement and is the estimated covariance on time step i after seeing the measurement. is mean of the measurement on time step i. is the measurement residual on time step i. is the measurement prediction covariance on time step i. reflects how much prediction needs correction on time step i. . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Perform measurement updates as: Correlation is a statistical technique that can show whether and how strongly pairs of variables are related. We have used Pearson Correlation Coefficient for analysing the relationship between the generated features and the number of confirmed cases. Pearson correlation coefficient is a statistic which lies between -1 to +1, and measures linear correlation between any two variables Z and V. +1 means total positive linear correlation, -1 mean total negative linear correlation and 0 means no correlation. . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 3, 2020. . Pearson's correlation coefficient can be presented as ratio of the covariance of the two variables Z and V and the product of their standard deviations. Here, n is sample size. and are individual sample points indexed with i. ¯ and ¯are sample mean. and are sample standard deviation. Decision Trees are an important type of algorithm for predictive modelling machine learning. The classical decision tree algorithms have been around for decades and modern variations like random forest are among the most powerful techniques available. A decision tree is created by dividing up the input space. A greedy approach is used to divide the space called recursive binary splitting. This is a numerical procedure where all the values are lined up and different split points are tried and tested using a cost function. The split with the best cost is selected. All input variables and all possible split points are evaluated and chosen in a greedy manner. The cost function that is minimized to choose split points, for regression predictive modelling problems, as in our case, is the sum squared error across all training samples that fall within the rectangle: Here, y is the output for the training sample and prediction is the predicted output for the rectangle. The problem with decision trees are that they are sensitive to the specific data they are trained on. Changing the training data changes the structure of the tree and hence the predictions differ. They are computationally expensive to train, with high chance of overfitting. A Random Forest is a bagging technique, where the trees are run in parallel, without any interaction. It operates by constructing a multitude of decision trees at training time and outputting mean prediction of the individual trees. A random forest is a meta-estimator which aggregates many decision trees. Here, the number of features that can be split on at each node is limited. It ensures that the ensemble model doesn't rely too much on any individual feature. Each tree draws a random sample from the original dataset which prevents over fitting. . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 3, 2020. Every decision tree of the forest chooses a subset of the feature set and produces a result accordingly. By analysing the result of every such decoupled decision tree, it can be decided which feature has more ability to guide us to the required result. The more a feature contributes towards minimizing the error, the more it's importance. The minimization effect for a feature can be calculated by taking mean of the error values from the trees in which it appears. It can also be said that features that appear on higher levels of the tree are more important as they contribute to relatively higher information gain. It can be said that random forests provide better results for feature importance as it includes results of several decision trees and thus is more generalised than other approaches. Steps followed: It provides the different attributes like, Province/State, Country/Region, Confirmed COVID_19 cases, Death, and Cured. We collected the data between 31-01-2020 to 22-04-2020 for above study. The data of weathers are also collected for study of their effects. The maximum temperature, minimum temperature, humidity etc was fetched from available API in python. Pandas and Matplotlib in Python is used to plot these statistics. Kalman future forecast algorithm was used to predict the future growth of number of cases in India as whole, as well as Indian state wise. Kalman algorithm requires Time-Series Data as input hence some data pre-processing was done on the collected data. Pandas in Python are used to pre-process these .csv files. The authors for this manuscript collected the data of the temperature and humidity from fetched https://www.worldweatheronline.com/ API in python. All the major COVID-19 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 3, 2020. . https://doi.org/10.1101/2020.05.30.20117416 doi: medRxiv preprint hotspots cities were considered from each states and an average of their data was allotted to the corresponding state. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 3, 2020. Covid-19 from all these states for our analysis as well as prediction purpose. Training of the proposed model has been done by using the data between January 30 and May 09, 2020. Figure 1 shows the spread scenario in these states during this period. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 3, 2020. . https://doi.org/10.1101/2020.05.30.20117416 doi: medRxiv preprint positive correlation with growth in 1 day, growth in 3 days, growth in 5 days, and growth in 7 days. Total confirmed cases are highly dependent on the cases reported in previous 7 days. Hence, it is supporting the well exist idea of COVID-19 spread chain which is dependent on number of previous cases [22, 23] . . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 3, 2020. . https://doi.org/10.1101/2020.05.30.20117416 doi: medRxiv preprint importance for the prediction. It has also been seen from fig 3 that maximum temperature is having very less importance as compared to minimum temperature. Humidity is also playing crucial role in spread and is well noted from the figure. It has been pointed out from fig. 2 and 3 that humidity is having negative correlation with confirmed cases and prediction model also. It has higher importance for prediction than temperature. . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 3, 2020. The time-series dataset prepared is then used for future forecast of COVID-19 cases; statewise first then for India. Firstly, our Kalman Filter is applied for the training till May 01, 2020. Validation of our prediction model has been performed by using the dataset fro May 02-09, 2020. Predicted data has been compared with the real data for same time period. Table 3 shows the average mean error reported for different states of India in prediction. This error is absolute difference between predicted values and real data. It can be noted that mean . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 3, 2020. . Figure 4 shows the validation results of prediction model with state wise data. Some random states have been chosen to show the graphical results. All state data has been shown in table 3 . . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 3, 2020. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 3, 2020. . https://doi.org/10.1101/2020.05.30.20117416 doi: medRxiv preprint Figure 5 shows the results obtained by prediction model for next 30 days cases of COVID-19 for different states in India. It has been observed that Kalman filter based prediction model shows higher deviation from real data for long term prediction. Kalman filter based prediction is more accurate for short term prediction. This phenomenon is well supported by Pearson coefficient and random forest based study. Both the studies are showing that confirmed cases have strong positive correlation as well as high importance for historical spread data. Hence, any error in prediction for a single day will be propagated and will produce the larger error after few days. Hence, Kalman filter based prediction model is good for short term prediction i.e. Daily and Weekly. . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 3, 2020. . https://doi.org/10.1101/2020.05.30.20117416 doi: medRxiv preprint (e) The present manuscript presented a prediction model based on Kalman filter. The correlations between different features of COVID-19 spread have been studied. It has been found that previous spread data has strong positive correlation with the prediction. The importance of different features in prediction model has also been studied in the present manuscript. It has been noted that historical spread scenario has large impact on the current spread. Hence, it can be concluded that COVID-19 spread is following a chain. Hence, to reduce the spread this chain has to be breaked. The proposed prediction model is providing encouraging results for the short term prediction. It has been noted that for long term prediction, Kalman filter based proposed model is showing large mean average error. Hence, it can be conclude that proposed prediction model is good for short term prediction i.e. daily and weekly. The proposed prediction model can be updated to accommodate long term and medium term time series prediction in future. . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 3, 2020. . Learning https://medium.com/@Synced/how-random-forest-algorithm-works-in-machine-learning3c0fe15b6674 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 3, 2020. . https://doi.org/10.1101/2020.05.30.20117416 doi: medRxiv preprint Outbreak of pneumonia of unknown etiology in Wuhan, China: The mystery and the miracle Early prediction of the 2019 novel corona virus outbreak in the mainland china based on simple mathematical model Sars-cov-2 viral load in upper respiratory specimens of infected patients Covid-19 epidemic editorial Novel Corona virus (2019-nCoV) Advice for the public Estimating the asymptomatic proportion of coronavirus disease 2019(covid-19) cases on board the diamond princess cruise ship, yokohama, japan. Euro surveillance Herd immunity estimating the level required to halt the covid-19 epidemics in affected countries Characterizing the Propagation of Situational Information in Social Media During COVID-19 Epidemic: A Case Study on Weibo Prospective Infectious Disease Outbreak Detection Using Markov Switching Models Study on Prediction Model of HIV Incidence Based on GRU Neural Network Optimized by MHPSO Mesh free alternate directional implicit method based three dimensional super-diffusive model for benign brain tumor segmentation Mesh Free based Variational Level Set Evolution for Breast Region Segmentation and Abnormality Detection using Mammograms Fractional Order Savitzky-Golay Differentiator based Approach for Mammogram Enhancement The Kalman Filter: An algorithm for making sense of fused sensor insight Gps/imu data fusion using multisensory kalman filtering: introduction of contextual aspects Applications of an Extended Kalman filter in nonlinear mechanics An extended Kalman filter for collaborative supply chains Kalman filtering for manufacturing processes Estimating train parameters with an unscented kalman filter Study of Non-Pharmacological Interventions on COVID-19 Spread