key: cord-035113-bhnv2qbi authors: Senapati, Apurbalal; Nag, Amitava; Mondal, Arunendu; Maji, Soumen title: A novel framework for COVID-19 case prediction through piecewise regression in India date: 2020-11-10 journal: Int J Inf Technol DOI: 10.1007/s41870-020-00552-3 sha: doc_id: 35113 cord_uid: bhnv2qbi Outbreak of COVID-19, created a disastrous situation in more than 200 countries around the world. Thus the prediction of the future trend of the disease in different countries can be useful for managing the outbreak. Several data driven works have been done for the prediction of COVID-19 cases and these data uses features of past data for future prediction. In this study the machine learning (ML)-guided linear regression model has been used to address the different types of COVID-19 related issues. The linear regression model has been fitted into the dataset to deal with the total number of positive cases, and the number of recoveries for different states in India such as Maharashtra, West Bengal, Kerala, Delhi and Assam. From the current analysis of COVID-19 data it has been observed that trend of per day number of infection follows linearly and then increases exponentially. This property has been incorporated into our prediction and the piecewise linear regression is the best suited model to adopt this property. The experimental results shows the superiority of the proposed scheme and to the best of our knowledge this is a new approach towards the prediction of COVID-19. The pandemic Corona Virus Disease (COVID- 19) has spread all over the world and every country is trying their way to contain the disease. On March 11, 2020, WHO declared the outbreak of COVID-19 as a global pandemic. Some countries like Taiwan, South Korea, Vietnam, New Zealand, Germany, France etc. successfully control the disease but many countries including India, USA, and Brazil is still struggling to suppress the pandemic. After lockdown up to 31st May 2020, Government of India started to unlock the country in a phased manner because it was not possible for a developing country like India to bear the financial losses of prolonged shut down of the entire country. From 1st June 2020 to 30th June 2020 has been declared as unlock-1, 1st July 2020 to 31st July 2020 has been declared as unlock-2 and from 1st August 2020 to 31st August 2020 has been declared as unlock-3. In all these unlock phases Government of India have provided guidelines to general people as well as different organizations about how to operate/perform businesses, transportations, festivals and other social and commercial activities. A visual representation of the pandemic along with time frame and important decision of the Indian Government has been given in Fig. 1 . Till today, indeed, no matter whatever actions/ precautions/ decisions Government of India has taken, the spread of COVID-19 across the different states of the country couldnt be stopped, which creates panic to millions of people all over the country. Almost all the states across India, the new cases as well as deaths are increasing exponentially. Because of the increase in the number of patients with limited healthcarerelated infrastructures, some states have started to impose lockdown either partially or fully from the middle of July during unlock phase-2. Due to this increased complexity in normal life, it has impacted the financial health of the different country and to anticipate additional resource requirements to combat COVID-19, the prediction of future development trend of the pandemic has become a very popular research topic in recent days. Various mathematical and statistical forecasting tools [1] [2] [3] [4] [5] were applied to generate short-term and long-term forecasts. Multiple machine learning tools such as multiple regression analysis [6] , progressive partial derivative linear regression model [7] , and a hybrid approach of the auto regressive integrated moving average model and wavelet-based forecasting model [8] have also been applied to make these predictions. Different types of COVID-19 related issues has been addressed in this study, with the help of the piecewise regression Model, such as total number of positive cases, and the number of recoveries for different states in India such as Maharashtra, West Bengal, Kerala, Delhi and Assam. Our observation from the COVID-19 data is that it follows the linear trend within a few days and after which pattern becomes non-linear. This property has been incorporated in piecewise linear regression, which is best suited model to adopt this linear property. Moreover, the proposed scheme is also used to predict the number of total confirmed cases, active positive cases, and recoveries. We have chosen these five states because Maharashtra is the most affected state where the number of positive cases is maximum while West Bengal is within the top ten most affected states in India. The situation in Delhi and Kerala is a bit different from other states, while in Delhi the number of cases is decreasing in recent days, but in Kerala after controlling their daily infection rate, the number of positive cases per days increasing further. We have included Assam also in our study because, in northeast India, Assam is the state where the number of positive cases per day is increasing in alarming rate although initially, Assam was very much successful to combat the spreading of the disease by taking corrective and preventive measures in advance. However, no states have yet become successful to cure COVID-19 completely. Therefore, it is important to detect the epidemic pattern for planning the early action to combating COVID-19. Machine Learning (ML) can help us to infer useful knowledge from past massive epidemic data. Several researchers [8] [9] [10] [11] [12] have adopted machine learning based approach to analyze COVID-19 cases. ML is subdomain of Artificial Intelligence (AI) [13, 14] . It can intelligently solve a variety of problems with the help of learning from the recorded information. Generally, ML algorithms are classified into supervised learning and unsupervised learning [15] . Regression analysis, a part of Supervised learning is a powerful statistical tool to utilizes previous experience or example data in making prediction through the learning strategy. Based on the various learning strategies, there are many regression models such as linear regression, logistic regression, multiple regression, piecewise/ stepwise regression, etc. [16] [17] [18] . In this paper, we use piecewise linear regression approach to predict and track the spreading of the virus. The proposed scheme can be a magnificent weapon for early alerts against battling COVID-19. The main contributions of this paper are given below: The rest of the paper is organized as follows. An overview of doubling exponential model and piecewise regression approach is presented in Section 2. Section 3 provides the experimental results and discusses in detail the trend of COVID-19 cases. Finally in Sect. 4, conclusions and future work are depicted. In this section, we have discussed in details of our proposed scheme based linear regression model for prediction of the number of total confirmed cases, active positive cases, and recoveries. Firstly, infection spreading has been discussed, followed by the linear regression model used in the proposed work. In the proposed scheme, different types of data of various states such as data for confirmed cases, active positive cases, and recoveries have been collected. We have defined the infection spreading from the concept of exponential growth function or in particularly from the doubling exponential. First we describe the double exponent in brief and then it shows how the infection spreading is adopted from the doubling exponential. So far, many mathematical models characterized the early epidemic growth feature follow an exponential curve. Some of them characterized the exponential growth by the doubling time. The doubling time implies the time taken for the number of infections to double from a given day. We have also adopted the doubling concept with the different interpretation. Hence, for our model we consider the exponential function as. Here, instead of finding the number of infections at time t, we find the doubling time from the given y(t) i.e. the number of positive cases. In our experiment, define the doubling time as the number of days taken to become the double of the current count. Mathematically it can be defined as, if n is the positive case count at time t then ð2 Â nÞ is the positive case count at time at a time ðt þ t d Þ, Fig. 2 . Figure 2 clearly shows that initially it is a highly spreading situation. But, practically that was not happened; the graph shows it because of the initial low value. The positive case starts with one, then it became double ð2 Â 1Þ as two, ð2 Â 2Þ four, and so on but it does not mean high infection but is the boundary value problem. Next, it shows the infection spreading reaches the highest level i.e. the peak and gradually spreading became low and again it is gradually high. It indicates in the state Kerala there is a second phase infection spreading is going on. The most notable case for Delhi, it shows that it is gradually diminishing the spreading curve. On the other hand, for the state West Bengal the spreading is tends to high. In our prediction model the piecewise linear regression have been used, it is a special case of the linear regression. Sometimes data do not follow the linear pattern as shown in Fig. 3a . However, if it still tries to model them using the linear regression then it will not be properly correlated. When such a model uses to predict, then it results high error value. In that situation, one line simply is not enough to fit the data, then the concept of piecewise linear regression comes to overcome such limitation as depicted in Fig. 3b . When the data set follows different linear trends over the different partitions of data, then we should model the regression function in several pieces. Each linear regression is corresponding to a partition is the pieces and the pieces are connected or not connected depends on the data and the problem. In case of connections, the connecting points are known as the break points, i.e. the points where the slope changes. The point at x ¼ p is the joining point of two lines, i.e., a breakpoint. Our assumption is that the regression function to be continuous at the breakpoint, the two values for y need to be equal at the breakpoint (when x = p), i.e., we have the relation The same concept can be extended for more than two breakpoints, and it depends on the data. To implement this model from a given data set, the main challenge is to partition the data set for the piecewise regression. In other words, the problems are to find out the breakpoints from the data set. In our experiment, we have done by finding the slope of the consecutive pairwise points, i.e., if there are n points, then there will be ðn À 1Þ such slopes. From these slopes, whenever there is an abrupt change, then we consider that point is the breakpoint. This is done by the heuristic approach by the observation of the slopes. This section presents the experimental results in detail and discusses trend possibilities in the future. In our prediction, we have considered the training data up to 10th September 2020. The prediction results of confirm and cure cases of COVID-19 upto 30-10-2020 are listed in Tables 1 and 2 respectively. The source of the dataset used in this paper for the COVID-19 outbreak of India is Kaggle [19] . The cumulative confirm cases and daily confirm cases of COVID-19 for selected states in India are shown in Fig. 4 . Cumulative confirmed and daily confirmed COVID-19 cases has increased exponentially in Maharashtra, whereas also would be increased further in near future as per our predicted model. However, it has been spotted that daily confirmed cases in Delhi is decreasing in nature and following the same trend in upcoming days. Also, from our study, it is inferred that in the case of Kerala, West Bengal and Assam follows the same current trend in the upcoming month. Figure 6 shows cumulative confirm cases and daily confirm cases of COVID-19 for five states separately. Figure 5 provide the cumulative recovery cases and daily recovery cases respectively of COVID-19 for selected states in India. The predicted results for cumulative confirm cases are also shown in Table 2 . In the earlier section, we have seen that the maximum cumulative and daily basis positive active cases have been found in the state of Maharashtra. And the same trend has been observed in Maharashtra for the current and predicted recovery cases from COVID-19. Sometimes, Delhi has also reported maximum daily recovery cases. Also, from our study, it is perceived that in the case of Kerala, West Bengal and Assam follows the same current trend in the upcoming month. A comparison is carried out with an existing other system on the regression model of a sixth-degree polynomial [8] . For the compatibility, we have considered the same training data as of that system and predicted the dated as they predicted. The training data is used in [8] from 1st March to 11th April 2020 and predicted for the next seven days i.e. from 12th April to 18th April 2020. On the other hand, our proposed approach uses the training data from 1st March, 2020 to 10th September 2020 and predicts confirm and cure cases till 30th October, 2020. Moreover, the proposed model also addresses the cure cases which is missing in [8] . Table 3 and Fig. 7 gives the details of the comparison. The magnitude of relative error (MRE) and mean absolute percentage error (MAPE) are mostly used to measure the accuracy of regression based prediction model [20, 21] . The performance of the proposed scheme is evaluated using the magnitude of relative error (MRE) and mean absolute percentage error (MAPE). The magnitude of relative error (MRE) is defined as and the mean absolute percentage error (MAPE) is where x i is actual value, y i is the predicted value and N is total number of observations. Table 4 presents the values of MRE and MAPE for both of the positive COVID-19 cases and recovery COVID-19 cases. A lower MRE value indicates that data are scattered close to the regression line. Correspondingly small value of MAPE denotes that the differences between the actual and predicted values are small. Therefore, it can be concluded from Table 4 that the actual and predicted values are much closed. The COVID-19 outbreaks have become a disaster for several nations. However, the recovery rate of COVID-19 in India is more than 88%. In this study, we have proposed the piecewise linear regression based machine learning approach for the prediction of actual positive cases and recovery cases of five different states in India. The main novelty of the proposed scheme is that we have applied piecewise linear regression method instead of simple linear regression. As a result, the proposed scheme produces an accurately predicted result for both cases. Henceforth, it may be concluded that our model could be applicable for other parameters of COVID-19 also in any state or country as well.In the future, we will focus on developing various ML-and DL-based model to enhance the performance to combat COVID-19 as well as other pandemic may be. The main challenge to implement the piecewise linear regression is to find the point of partition of data .In this paper, the partition has been done by observing the slope of the point heuristically and we have considered the past 7 days data to predict the next day. In future, our aim is to solve the problem to find an optimal partitioning point such that the error becomes minimum. Xing X (2020) Early transmission dynamics in Wuhan, China, of novel corona virus infected pneumonia Now casting and forecasting the potential domestic and international spread of the 2019-nCoV outbreak originating in Wuhan, China: a modelling study Analysis and forecast of COVID-19 spreading in China, Italy and France Early dynamics of transmission and control of COVID-19: a mathematical modelling study Preliminary estimation of the novel coronavirus disease (COVID-19) cases in Iran: a modelling analysis based on overseas cases and air travel data Prediction of the number of deaths in India due to SARS-CoV-2 at 56 weeks Partial derivative nonlinear global pandemic machine learning prediction of covid 19 Data analysis of COVID-2019 epidemic using machine learning methods: a case study of India Nemati Nazafarin (2020) Machine-learning approaches in COVID-19 survival analysis and discharge-time likelihood prediction using clinical data COVID-19 epidemic analysis using machine learning and deep learning algorithms Din Ud et al (2020) Machine learning based approaches for detecting COVID-19 using clinical text data Real-time forecasts and risk assessment of novel coronavirus (COVID-19) cases: a data-driven analysis A machine learning based method to detect epilepsy Deep learning: evolution and expansion A review of supervised and unsupervised machine learning techniques for suspicious behavior recognition in intelligent surveillance system Piecewise linear regression based on plane clustering Regression based FPGA power estimation tool ( FPETool ) for embedded multiplier block Comparison and analysis of logistic regression, Nave Bayes and KNN machine learning algorithms for credit card fraud detection COVID-19 in India, dataset on novel corona virus disease 2019 in India Alternatives to accuracy and bias metrics based on percentage errors for radiation belt modeling applications. No. LA-UR-16-24592 An IoT-based system to evaluate indoor air pollutants using grey relational analysis