key: cord-152028-c8xit4tf authors: Javid, Alireza M.; Liang, Xinyue; Venkitaraman, Arun; Chatterjee, Saikat title: Predictive Analysis of COVID-19 Time-series Data from Johns Hopkins University date: 2020-05-07 journal: nan DOI: nan sha: doc_id: 152028 cord_uid: c8xit4tf We provide a predictive analysis of the spread of COVID-19, also known as SARS-CoV-2, using the dataset made publicly available online by the Johns Hopkins University. Our main objective is to provide predictions of the number of infected people for different countries in the next 14 days. The predictive analysis is done using time-series data transformed on a logarithmic scale. We use two well-known methods for prediction: polynomial regression and neural network. As the number of training data for each country is limited, we use a single-layer neural network called the extreme learning machine (ELM) to avoid over-fitting. Due to the non-stationary nature of the time-series, a sliding window approach is used to provide a more accurate prediction. The COVID-19 pandemic has led to a massive global crisis, caused by the rapid spread rate and severe fatality, especially, among those with a weak immune system. In this work, we use the available COVID-19 time-series of the infected cases to build models for predicting the number of cases in the near future. In particular, given the time-series till a particular day, we make predictions for the number of cases in the next τ days, where τ ∈ {1, 3, 7, 14}. This means that we predict for the next day, after 3 days, after 7 days, and after 14 days. Our analysis is based on the time-series data made publicly available on the COVID-19 Dashboard by the Center for Systems Science and Engineering (CSSE) at the Johns Hopkins University (JHU) (https://systems.jhu.edu/research/public-health/ncov/) [1] . Let y n denote the number of confirmed cases on the n-th day of the time-series after start of the outbreak. Then, we have the following • The input consists of the last n samples of the time-series given by y n [y 1 , y 2 , · · · , y n ]. • The predicted output is t n =ŷ n+τ , τ ∈ {1, 3, 7, 14}. • Due to non-stationary nature of the time-series data, a sliding window of size w is used over y n to make the prediction, and w is found via cross-validation. • The predictive function f ( · ) is modeled either by a polynomial or a neural network, and is used to make the prediction:ŷ n+τ = f (y n ) The dataset from JHU contains the cumulative number of cases reported daily for different countries. We base our analysis on 12 of the countries listed in Table I . For each country, we consider the time-series y n starting from the day when the first case was reported. Given the current day index n, we predict the number of cases for the day n + τ by considering as input the number of cases reported for the past w days, that is, for the days n − w + 1 to n. We use data-driven prediction approaches without considering any other aspect, for example, models of infectious disease spread [2] . We apply two approaches to analyze the data to make predictions, or in other words, to learn the function f : • Polynomial model approach: Simplest curve fit or approximation model, where the number of cases is approximated locally with polynomials − f is a polynomial. • Neural network approach: A supervised learning approach that uses training data in the form of input-output pairs to learn a predictive model − f is a neural network. We describe each approach in detail in the following subsections. A. Polynomial model 1) Model: We model the expected value of y n as a third degree polynomial function of the day number n: The set of coefficients {p 0 , p 1 , p 2 , p 3 } are learned using the available training data. Given the highly non-stationary nature of the time-series, we consider local polynomial approximations of the signal over a window of w days, instead of using all the data to estimate a single polynomial f ( · ) for the entire time-series. Thus, at the n-th day, we learn the corresponding polynomial f ( · ) using y n,w [y n−w+1 , · · · , y n−1 , y n ]. 2) How the model is used: Once the polynomial is determined, we use it to predict for (n + τ )-th day aŝ For every polynomial regression model, we construct the corresponding polynomial function f ( · ) by using y n,w as the most recent input data of size w. The appropriate window size w is found through cross-validation. B. Neural networks 1) Model: We use Extreme Learning Machine (ELM) as the neural network model to avoid overfitting to the training data. As the length of the time-series data for each country is limited, the number of training samples for the neural network would be quite small, which can lead to severe overfitting in large scale neural network such as deep neural networks (DNNs), convolutional neural networks (CNNs), etc. [3], [4] . ELM, on the other hand, is a single layer neural network which uses random weights in its first hidden layer [5] . The use of random weights has gained popularity due to its simplicity and effectiveness in training [6]- [8] . We now briefly describe ELM. Consider a dataset containing N samples of pair-wise Pdimensional input data x ∈ R P and the corresponding Qdimensional target vector t ∈ R Q as D = {(x n , t n )} N n=1 . We construct the feature vector as z n = g(Wx n ) ∈ R h , where • weight matrix W ∈ R h×P is an instance of Normal distribution, • h is the number of hidden neurons, and • activation function g( · ) is the rectified linear unit (ReLU). To predict the target, we use a linear projection of feature vector z n onto the target. Let the predicted target for the nth sample be Oz n . Note that O ∈ R Q×h . By using ℓ 2 -norm regularization, we find the optimal solution for the following convex optimization problem where · F denotes the Frobenius norm. Once the matrix O ⋆ is learned, the prediction for any new input x is now given bŷ 2) How the model is used: When using ELM to predict the number of cases, we define x n = [y n−w+1 , ..., y n−1 , y n ] ⊤ and t n = [y n+τ ]. Note that x n ∈ R w and t n ∈ R. For a fixed τ ∈ {1, 3, 7, 14}, we use cross-validation to find the proper window size w, number of hidden neurons h, and the regularization hyperparameter λ. In this subsection, we make predictions based on the timeseries data which currently is available until today May 4, 2020, for τ ∈ {1, 3, 7}. We estimate the number of cases for the last 31 days of the countries in Table I . For each value of τ ∈ {1, 3, 7}, we compare the estimated number of caseŝ y n+τ with the true value y n+τ and report the estimation error in percentage, i.e., We carry out two sets of experiments for each of the two approaches (polynomial and ELM) to examine their sensitivity to the new arriving training samples. In the first set of experiments, we implement cross-validation to find the hyperparameters without using the observed samples of the time-series as we proceed through 31 days span. In the second set of experiments, we implement cross-validation in a daily manner as we observe new samples of the time-series. In the latter setup, the window size w varied with respect to time to find the optimal hyperparameters as we proceed through time. We refer to this setup as 'ELM time-varying' and 'Poly time-varying' in the rest of the manuscript. We first show the reported and estimated number of infection cases for Sweden by using ELM time-varying for different τ 's in Figure 1 . For each τ , we estimate the number of cases up to τ days after which JHU data is collected. In our later experiments, we show that ELM time-varying is typically more accurate than the other three methods (polynomial, Poly timevarying, and ELM). This better accuracy conforms to the nonstationary behavior of the time-series data, or in other words that the best model parameters change over time. Hence, the result of ELM time-varying is shown explicitly for Sweden. According to our experimental result, we predict that a total of 23039, 23873, and 26184 people will be infected in Sweden on May 5, May 7, and May 11, 2020, respectively. Histograms of error percentage of the four methods are shown in Figure 2 for different values of τ . The histograms are calculated by using a nonparametric kernel-smoothing distribution over the past 31 days for all 12 countries. The daily error percentage for each country in Table I is shown in Figures 7-15 . Note that the reported error percentage of ELM is averaged over 100 Monte Carlo trials. The average and the standard deviation of the error over 31 days is reported (in percentage) in the legend of each of the figures for all four methods. It can be seen that daily cross-validation is crucial to preserve a consistent performance through-out the pandemic, resulting in a more accurate estimate. In other words, the variations of the time-series as n increases is significant enough to change the statistics of the training and validation set, which, in turn, leads to different optimal hyperparameters as the length of the time-series grows. It can also be seen that ELM time-varying provides a more accurate estimate, especially for large values of τ . Therefore, for the rest of the experiments, we only focus on ELM time-varying as our favored approach. Another interesting observation is that the performance of ELM time-varying improves as n increases. This observation verifies the general principle that neural networks typically perform better as more data becomes available. We report the average error percentage of ELM time-varying over the last 10 days of the time-series in Table II . We see that as τ increases the estimation error increases. When τ = 7, ELM time-varying works well for most of the countries. It does not perform well for France and India. This poor estimation for a few countries could be due to a significant amount of noise in the timeseries data, even possibly caused by inaccurately reported daily cases. In this subsection, we repeat the prediction based on the time-series data which is available until today May 12, 2020, for τ ∈ {1, 3, 7}. In Subsection IV-A, we predicted the total number of cases in Sweden on May 5, May 7, and May 11, 2020. The reported number of cases on these days for Sweden turned out to be 23216, 24623, and 26670, respectively, which is in the similar range of error that is reported in Table II . We show the reported and estimated number of infection cases for Sweden by using ELM time-varying for different τ 's in Figure 3 . For each τ , we estimate the number of cases up to τ days after which JHU data is collected. According to our experimental result, we predict that a total of 27737, 28522, and 30841 people will be infected in Sweden on May 13, May 15, and May 19, 2020, respectively. Histograms of error percentage of the four methods are shown in Figure 4 for different values of τ . These experiments verify that ELM time-varying is the most consistent approach as the length of the time-series increases from May 4 to May 12. We report the average error percentage of ELM timevarying over the last 10 days of the time-series in Table III . We see that as τ increases the estimation error increases. When τ = 7, ELM time-varying works well for all of the countries except India, even though the number of training samples has increased compared to Subsection IV-A. In this subsection, we repeat the prediction based on the time-series data which is available until today May 20, 2020, for τ ∈ {1, 7, 14}. In Subsection IV-B, we predicted the total number of cases in Sweden on May 13, May 15, and May 19, 2020. The reported number of cases on these days for Sweden turned out to be 27909, 29207, and 30799, respectively, which is in the similar range of prediction error that is reported in Table III . We increase the prediction range τ in this subsection and we show the reported and estimated number of infection cases for Sweden by using ELM time-varying for τ = 1, 7, and 14 in Figure 5 . For each τ , we estimate the number of cases up to τ days after which JHU data is collected. According to our experimental result, we predict that a total of 32032, 34702, and 37188 people will be infected in Sweden on May 21, May 27, and June 3, 2020, respectively. Histograms of error percentage of the four methods are shown in Figure 6 for different values of τ . These experiments verify that ELM time-varying is the most consistent approach as the length of the time-series increases from May 12 to May 20. We report the average error percentage of ELM timevarying over the last 10 days of the time-series in Table IV . We see that as τ increases the estimation error increases. When τ = 7, ELM time-varying works well for all of the countries so we increase the prediction range to 14 days. We observe that ELM time-varying fails to provide an accurate estimate for several countries such as France, India, Iran, and USA. This experiment shows that long-term prediction of the spread COVID-19 can be investigated as an open problem. However, by observing Tables II-IV, we expect that the performance of ELM time-varying to improve in the future as the number of training samples increases during the pandemic. We studied the estimation capabilities of two well-known approaches to deal with the spread of the COVID-19 pandemic. We showed that a small-sized neural network such as ELM provides a more consistent estimation compared to polynomial regression counterpart. We found that a daily update of the model hyperparameters is of paramount importance to achieve a stable prediction performance. The proposed models currently use the only samples of the time-series data to predict the future number of cases. A potential future direction to improve the estimation accuracy is to incorporate constraints such as infectious disease spread model, non-pharmaceutical interventions, and authority policies [2]. [3] Christian Szegedy, Alexander Toshev, and Dumitru Erhan, "Deep neural networks for object detection," in Advances in Neural Error (%) ELM time-varying, avg = 7.2%, std = 6% ELM, avg = 45.6%, std = 14.3% Poly time-varying, avg = 105.7%, std = 121 Poly, avg = 52.3%, std = 89 Error (%) ELM time-varying, avg = 15.3%, std = 14.5% ELM, avg = 18.5%, std = 16.1% Poly time-varying, avg = 210.2%, std = 922% Poly, avg = 2335%, std = 9655 Error (%) ELM time-varying, avg = 12%, std = 15.6% ELM, avg = 17.3%, std = 9.9% Poly time-varying, avg = 177.6%, std = = 1868.6%, std = 5659.9% Error (%) ELM time-varying, avg = 4.8%, std = 5.5% ELM, avg = 4.5%, std = 3.3% Poly time-varying, avg = 124.8%, std = 487 Poly, avg = 148.3%, std = 459 Error (%) ELM time-varying, avg = 18.4%, std = 11.6% ELM, avg = 14.4%, std = 6.4% Poly time-varying, avg = 34%, std = 27% Poly, avg = 55.5%, std = Error (%) ELM time-varying, avg = 3.8%, std = 5.8% ELM, avg = 1.6%, std = 1.2% Poly time-varying, avg = 22.9%, std = = 12.6%, std = 13 Error (%) ELM time-varying, avg = 10%, std = 12.3% ELM, avg = 9.2%, std = 7.3% Poly time-varying, avg = 40.4%, std = 27 Poly, avg = 43.5%, std = 30 Error (%) ELM time-varying, avg = 14.1%, std = 24% ELM, avg = 169.8%, std = 27.8% Poly time-varying, avg = 37%, std = 30 Error (%) ELM time-varying, avg = 0.3%, std = 0.2% ELM, avg = 1%, std = 0.3% Poly time-varying, avg = 1% Error (%) ELM time-varying, avg = 25.9%, std = 27.5% ELM, avg = 23.6%, std = 14.3% Poly time-varying, avg = 161.2%, std = 609 = 4342.9%, std = 12418% Error (%) ELM time-varying, avg = 7.7%, std = 10.4% ELM, avg = 12.9%, std = 5% Poly time-varying, avg = 33.5%, std = 47 Poly, avg = 35.2%, std = 71 Error (%) ELM time-varying, avg = 24.1%, std = 69% ELM, avg = 265%, std = 84.8% Poly time-varying, avg = 34.5%, std = 31 Poly, avg = 55.5%, std = 59 15: Daily error percentage of the last 31 days of 12 countries for ELM and polynomial regression