key: cord-031940-bbord079 authors: Ye, Tingqing; Yang, Xiangfeng title: Analysis and prediction of confirmed COVID-19 cases in China with uncertain time series date: 2020-09-16 journal: Fuzzy Optim Decis Making DOI: 10.1007/s10700-020-09339-4 sha: doc_id: 31940 cord_uid: bbord079 This paper presents an uncertain time series model to analyse and predict the evolution of confirmed COVID-19 cases in China, excluding imported cases. Compared with the results of the classical time series model, the uncertain time series model could better describe the COVID-19 epidemic by using an uncertain hypothesis test to filter out outliers. This improvement is reflected in the two observations. One is that the estimated variance of the disturbance term in the uncertain time series model is more appropriate and acceptable than that in the classical time series model, and the other is that the disturbance term of the classical time series model cannot be regarded as a random variable but as an uncertain variable. COVID-19 has become a pandemic and a public health emergency. As of March 23, 2020, a total of 67,801 confirmed cases and 3160 deaths have been reported in mainland China. Respiratory droplets and human-to-human contact are the major routes of transmission of COVID-19. Based on initial confirmed COVID-19 cases, how can we model its evolution? There exist two mathematical methods: probabilistic statistics and uncertain statistics. The difference between probabilistic statistics and uncertain statistics is that the former is in the framework of the probability theory, while the latter is in the structure of the uncertainty theory. Probability theory is a "multiplication" mathematical system, and uncertainty theory is a "minimum" mathematical system. The origin of uncertainty theory could be traced to the pioneering work of Liu (2007) , who established uncertainty theory based on normality, duality, and subadditivity axioms. In addition, to address the product uncertain measure, Liu (2009) introduced the fourth axiom-product axiom of uncertainty theory. Since then, uncertainty theory has become an ideal mathematical system and has been applied in many fields. This paper will adopt an uncertain statistics methodology to interpret and analyse confirmed COVID-19 cases in China. Uncertain regression analysis as a field of uncertain statistics was first presented by Yao and Liu (2018) . They applied the least squares method to estimating the parameters of the uncertain regression model. Later, Lio and Liu (2018) investigated the residual and confidence intervals in an uncertain regression model. As an extension of an uncertain regression model, a multivariate uncertain regression model was studied by Song and Fu (2018) , Ye and Liu (2020a) , and Zhang et al. (2020) . In addition, some researchers analysed the other uncertain regression models, such as the uncertain Verhulst-Pearl model by Liu (2019b) , the uncertain Gompertz regression model by Hu and Gao (2020) , the uncertain revised regression model by Fang and Hong (2020) , and the uncertain Chapman-Richards growth model by Liu and Jia (2020) . Other scholars have suggested the various estimation approaches, such as the least absolute deviations estimation by Liu and Yang (2020b) , the Tukeys biweight estimation by Chen (2020) , and the uncertain maximum likelihood estimation by Lio and Liu (2020) . To test the fitness of estimation in an uncertain regression model, Ye and Liu (2020b) introduced an uncertain hypothesis test. It is worth mentioning that Liu (2020) investigated an uncertain growth model and applied it to depicting the cumulative number of COVID-19 infections in China. Uncertain time series analysis, another field of uncertain statistics, was first proposed by Yang and Liu (2019) , who considered time series data as uncertain variables and implemented an uncertain autoregressive model. They used a least squares method to estimate the unknown coefficients of the uncertain autoregressive model. Subsequently, proposed a least absolute deviations method, and Zhao et al. (2020) derived the analytic solution of uncertain autoregressive model based on the principle of least squares. Yang and Ni (2020) introduced another uncertain time series model, the uncertain moving average model, and converted the 1-order uncertain moving average model into an uncertain autoregressive model. Liu and Yang (2020a) offered a cross-validation process to determine the order of the uncertain autoregressive model. Tang (2020) extended the uncertain autoregressive model in the sense of vector situation and developed an uncertain vector autoregressive model. In this study, we aimed to implement an uncertain time series model to analyse and predict the evolution of confirmed COVID-19 cases. The rest of this paper is organized as follows. Section 2 employs a classical time series analysis to predict the cumulative number of confirmed COVID-19 cases in China, but the result is not ideal. Section 3 introduces uncertain time series analysis, including least squares estimations, residual analysis, the uncertain hypothesis test, the forecast value, and the confidence interval. Section 4 applies uncertain time series analysis to confirmed COVID-19 cases to obtain the uncertain time series model without outliers. Section 5 compares the results Table 1 The cumulative number of confirmed COVID-19 cases excluding imported cases in China from February 13 to March 23, 2020 63,851 72,436 76,288 78,064 79,824 80,389 80,668 80,725 80,738 80,739 66,492 74,185 76,936 78,497 80,026 80, produced by classical time series analysis and uncertain time series analysis. Finally, Sect. 6 provides the conclusions of this paper. In this section, classical time series analysis is applied to modeling the cumulative number of confirmed COVID-19 cases by local transmission in China. We collected the data on the cumulative number of confirmed COVID-19 cases in China, excluding imported cases from February 13 to March 23, 2020 (see Table 1 ) via the website of the National Health Commission of China National Health (2020). Although the Chinese government has been publishing data since January 20, the confirmed cases before February 13 are not real-time data due to the limitation of laboratory testing capacity for COVID-19. To model the data in Table 1 , we denote the cumulative number of confirmed COVID-19 cases in China from February 13 to March 23, 2020 by X 1 , X 2 , . . . , X 40 ( Fig. 1) . In other words, for each t (t = 1, 2, . . . , 40), X t represents the cumulative number of confirmed COVID-19 cases in China on day t after February 12, 2020. For example, when t = 1, X 1 represents the cumulative number of confirmed COVID-19 cases in China on February 13, 2020, and when t = 40, X 40 represents the cumulative number of confirmed COVID-19 cases in China on March 23, 2020. According to the classical method, we first perform a stationarity test. Unfortunately, the data X 1 , X 2 , . . . , X 40 do not pass the stationarity test. To obtain stationary data, we should derive the 1-order difference of X t . Thus we write However, the data Y 1 , Y 2 , . . . , Y 39 still do not pass the stationarity test. Thus, we denote the 2-order difference of X t by for t = 1, 2, . . . , 38 (Fig. 3) . Fortunately, the data Z 1 , Z 2 , . . . , Z 38 now pass the stationarity test and the white noise test. Let us consider an ARMA( p, q) model. Based on BIC criteria, we have T. Ye, X. Yang ( p, q) = (4, 0). By using MATLAB 1 , the ARMA(4,0) model is (2) In the ARMA(4,0) model (2), the estimated variance of the disturbance term is 201.41 2 . It is too large to be accepted. In addition, a residual plot in classical time series analysis is usually required to be a null plot which has no separated points and describes the random variable with constant mean and variance. However, it seems that the residual plot for the ARMA(4,0) model (2) shown in Fig. 4 does not look like a null plot. In this case, if we still use classical time series analysis, then the distribution function is not close enough to the frequency we get. Thus the disturbance term cannot be regarded as a random variable, and the classical time series analysis is not appropriate for predicting the future the cumulative number of confirmed COVID-19 cases in China. That is why we are going to try to use uncertain time series analysis in the next sections. In this section, we will introduce the uncertain time series analysis, including least squares estimations, residual analysis, uncertain hypothesis test, forecast value, and confidence interval. Since we choose ARMA(4,0) model in the classical time series analysis, we only introduce the uncertain autoregressive model for the convenience of calculation and comparison. Yang and Liu (2019) proposed the uncertain autoregressive model, where a 0 , a 1 , . . . , a k are unknown parameters, k is the order of the autoregressive model, and {ε 1 , ε 2 , . . .} is a sequence of the disturbance terms which are iid uncertain variables with a common uncertainty distribution N(e, σ ), in which e and σ are unknown parameters. The uncertainty distribution of N(e, σ ) is In order to be consistent with the classical time series model in writing, we write the uncertain autoregressive model (3) as an uncertain ARMA(k, 0) model. When Z 1 , Z 2 , . . . , Z n are observed, Yang and Liu (2019) defined the least squares estimations of a 0 , a 1 , . . . , a k in the uncertain ARMA(k, 0) model (3) is the solution of the minimization problem Denote the optimal solutions of Eq. (4) byâ 0 ,â 1 , . . . ,â k . Then the fitted ARMA(k, 0) model is Then we use the average of the expected values of residuals, i.e., to estimate the expected value e of the disturbance term, and to estimate the variance σ 2 of the disturbance term. The uncertain hypothesis test, first proposed by Ye and Liu (2020b) , is a mathematical tool that uses uncertainty theory to rationally judge whether some hypotheses are correct or not according to the observed data. Next, the uncertain hypothesis test will be employed to test the appropriateness of estimations of the unknown parameters in the uncertain ARMA(k, 0) model. In Sect. 3.2, we obtained the estimationsê andσ . Naturally, we are interested in testing whether the estimationsê andσ are appropriate. Note thatê andσ are estimations of the expected value e and the standard variance σ of the disturbance term, respectively. Therefore, we want to test the appropriateness of the estimationŝ e andσ by testing whether e =ê and whether σ =σ . Let us consider the following two-sided hypotheses H 0 : e =ê and σ =σ versus H 1 : e =ê or σ =σ . For a given significance level α (e.g., 0.01), Ye and Liu (2020b) suggested that the test is If (ε k+1 ,ε k+2 , . . . ,ε n ) ∈ W , then we reject H 0 , and the estimationsê andσ do not pass the test. Otherwise, we accept H 0 , and the estimationsê andσ pass the test. Assume the estimationsê andσ do not pass the test, i.e., (ε k+1 ,ε k+2 , . . . ,ε n ) ∈ W . For each index t with k + 1 ≤ t ≤ n, the data Z t is called an outlier if otherwise, Z t is called a normal point. Since the outliers cause estimationsê andσ not to pass the test, we want to modify the outliers and then re-estimate the parameters with the modified data. The detailed process of data modification is as follows. For each index t (t = k + 1, k + 2, . . . , n), if Z t is not an outlier, then it remains unchanged. Otherwise the data modification can be broken down into four cases. Case I: If there is a normal point following Z t and Z t is not on the line l(s) formed by the normal point nearest to Z t forward and the normal point nearest to Z t backward in position, then replace the value of Z t with the value of l(t). For example, assume Z k+1 and Z k+2 are two outliers while Z k and Z k+3 are not, and assume neither Z k+1 nor Z k+2 is not on the line l(s) formed by Z k and Z k+3 (Fig. 5) . Then we replace the values of Z k+1 and Z k+2 with For example, assume Z k+1 and Z k+2 are two outliers while Z k and Z k+3 are not, and assume Z k+1 and Z k+2 are both on the line l(s) formed by Z k and Z k+3 (Fig. 6 ). Then we replace the values of Z k+1 and Z k+2 with fitted valueŝ Case III: If there is no normal point following Z t and Z t is not on the line l(s) formed by the normal point nearest to Z t forward and the normal point second nearest to Z t forward in position, then replace the value of Z t with the value of l(t). For example, assume Z n−1 and Z n are two outliers while Z n−3 and Z n−2 are not, and assume neither Z n−1 nor Z n is on the line l(s) formed by Z n−3 and Z n−2 (Fig. 7) . Then we replace the values of Z n−1 and Z n with l(n − 1) = Z n−2 + 1 · (Z n−2 − Z n−3 ), l(n) = Z n−2 + 2 · (Z n−2 − Z n−3 ), For example, assume Z n−1 and Z n are two outliers while Z n−3 and Z n−2 are not, and assume Z n−1 and Z n are both on the line l(s) formed by Z n−3 and Z n−2 . See Fig. 8 . Then we replace the values of Z n−1 and Z n with fitted valueŝ After a round of data modification, a new set of data Z 1 , Z 2 , . . . , Z n is obtained. Afterwards, we use the new data to recalculateâ 0 ,â 1 , . . . ,â k ,ε k+1 ,ε k+2 , . . . ,ε n ,ê, andσ with the method proposed in Sect. 3.2 and test the estimationsê andσ with the method in Sect. 3.3. If the estimations do not pass the test, then we continue to modify the data and repeat the process above until the estimations pass the test. Then, the final uncertain ARMA(k, 0) model is According to the final uncertain ARMA(k, 0) model (9), the forecast value of Z n+1 is T. Ye, X. Yanĝ Taking β as a confidence level (e.g., 95%), we find the minimum c such that This section condenses the method proposed in Sects. 3.1-3.5 into an algorithm to address the outliers found in the uncertain hypothesis test and obtain the estimations that can pass the test. Step 1. (Parameter estimation) Use the data Z 1 , Z 2 , . . . , Z n to compute least squares estimations of a 0 , a 1 , . . . , a k in the uncertain ARMA(k, 0) model with Eq. (4), denoted byâ 0 ,â 1 , . . . ,â k . Step 2. (Residual analysis) Calculateε k+1 ,ε k+2 , · · · ,ε n ,ê, andσ 2 with Eqs. (5)-(7). Step 3. (Uncertain hypothesis test) Take a significance level α (e.g., 0.01), and construct the test W with Eq. (8). Step 4. (Data modification) If (ε k+1 ,ε k+2 , . . . ,ε n ) / ∈ W , then go to Step 5. Otherwise, start the data modification. For each t (t = k + 1, k + 2, . . . , n), reset the data Z t by the method proposed in Sect. 3.4 if Z t is an outlier, otherwise it remains unchanged. Then, return to Step 1. Step 5. (Forecast value and confidence interval) Calculate the forecast valueẐ n+1 of Z n+1 with Eq. (10). Take β (e.g., 95%) as a confidence level, and find the minimum c satisfying Eq. (11). Then, the β-confidence interval of Z n+1 is [Ẑ n+1 − c,Ẑ n+1 + c]. In this section, we will apply the uncertain time series analysis to modeling the cumulative number of confirmed COVID-19 cases in China from February 13 to March 23, 2020 with the uncertain ARMA(k, 0) model. In order to select the order k in uncertain ARMA(k, 0) model, we employ the method of ν fold rolling origin cross validation proposed by Liu and Yang (2020a) based on the data Z 1 , Z 2 , . . . , Z 38 . Consider the maximum order 20 and ν = 26. Then corresponding average testing errors for different orders are shown in Table 2 . From Table 2 , we can see the average testing error is minimal when k = 5. Thus we choose uncertain ARMA(5, 0) model. According to the data Z 1 , Z 2 , . . . , Z 38 in Sect. 2, we can obtain the least squares estimationsâ respectively. In short, the uncertain ARMA(5, 0) model is Taking the level of significance α = 0.01, we obtain where −1 (α) is the inverse uncertainty distribution of N(0.0000, 78.593). It follows from Eq. (8) that the test is W = (ε 6 ,ε 7 , . . . ,ε 38 ) ∈ 33 − 199.11 ≤ε t ≤ 199.11, t = 6, 7, . . . , 38 c . Figure 9 shows the residuals in Table 3 and the two bounds (− 199.11 and 199.11 ). From Fig. 9 , we can see that onlyε 16 = −286.6 does not fall into the interval [−199.11, 199.11 ]. Thus, Z 16 is an outlier, and (ε 6 ,ε 7 , . . . ,ε 38 ) ∈ W which implies that the estimationsê andσ in the model (12) do not pass the test. Then, we reset Z 16 The rest of the data remain unchanged. Figure 10 shows the reset data. Based on the reset data in Fig. 10 , we can compute the estimationŝ a 0 = 2.7052,â 1 = − 0.3532,â 2 = − 0.1155,â 3 = 0.0298, a 4 = 0.3490,â 5 = 0.0484,ê = 0.0000,σ = 48.208, and all residualsε 6 ,ε 7 , . . . ,ε 38 . It follows from Eq. (8) that the test is W = (ε 6 ,ε 7 , . . . ,ε 38 ) ∈ 33 −122.13 ≤ε t ≤ 122.13, t = 6, 7, . . . , 38 c . Figure 11 shows the residuals and the two bounds (−122.13 and 122.13). From Fig. 11 The rest of the data remain unchanged. Figure 12 shows the reset data. After repeating the process 15 times as described in Sects. 4.2 and 4.3, we can compute the estimations Figure 13 shows the residuals and the two bounds (−26.823 and 26.823) . From Fig. 13 , we can see that all residuals fall into the interval [−26.823, 26.823] , that is, (ε 6 ,ε 7 , . . . ,ε 38 ) / ∈ W implying that the estimationsê andσ pass the test. Therefore, the final uncertain ARMA(5, 0) model is Compared with the estimated variance 201.41 2 in the model (2) produced by the classical time series analysis, the estimated variance 10.588 2 in the model (13) provided by the uncertain time series analysis is much smaller. In summary, the data have been modified for a total of 15 times, and a total of 8 days of data have been modified. Table 4 shows the results of the data modification. It follows from the model (13) and the method in Sect. 3.2 that the forecast value of Z 39 isẐ 39 = 1.6145, This section will compare the results produced by the classical time series analysis and the uncertain time series analysis. In Sect. 2, we get the classical ARMA(4,0) model In Sect. 3.6, the final uncertain ARMA(5, 0) model is All the results are shown in Table 5 . On the one hand, from Table 5 , we can see that the estimated variance obtained by the classical time series analysis, 201.41 2 , is too large to be accepted while the estimated variance obtained by the uncertain time series analysis, 10.588 2 , makes more sense. In terms of variance, uncertain time series analysis is more appropriate than classical time series analysis for predicting the cumulative number of confirmed COVID-19 cases, which is due to the uncertain hypothesis test, which can automatically filter out outliers. On the other hand, in Sect. 2, we have illustrated that the residual plot for the classical ARMA(4,0) model does not look like a null plot. In this case, if we use classical time series analysis, then the distribution function is not close enough to the frequency we get. Thus the disturbance term of classical time series model cannot be regarded as a random variable. Fortunately, Liu (2019a) suggested that you should use probability theory if your distribution function is close enough to the frequency, otherwise you have to use uncertainty theory. Therefore the disturbance term should be regarded as an uncertain variable instead of a random variable, and the uncertain time series analysis should be used to predict future cases of COVID-19 rather than the classical time series analysis. Furthermore, the 95% confidence interval can also be obtained with the uncertain time series analysis. The COVID-19 swept the world suddenly like a rainstorm. Increasing numbers of people suffer from his disease. It is important to study the trend of COVID-19 development. This paper used the cumulative number of confirmed COVID-19 cases in China, excluding imported cases from February 13 to March 23, 2020 to predict cumulative confirmed cases in the future by using the classical time series analysis. However, although the data and the model passed the white noise test and the stationarity test, the results were not ideal. This is mainly reflected in two observations. One is that the estimated variance of the disturbance term was too large to be accepted. The other is that the disturbance term could not be regarded as a random variable because the residual plot for the classical time series model did not look like a null plot. To address these two problems, we employed uncertain time series analysis, including least estimations, residual analysis, uncertain hypothesis test, forecast value, and confidence interval. The uncertain hypothesis test and the ability to automatically identify outliers reduced the estimated variance of the disturbance term to an acceptable value. Thus, the uncertain time series analysis was more appropriate than the classical time series analysis for predicting the cumulative number of confirmed COVID-19 cases in China. In the future, uncertain time series analysis can also be applied to modeling other statistics about COVID-19 in China or the world, such as cumulative COVID-19 deaths and cumulative cured COVID-19 cases. Tukeys biweight estimation for uncertain regression model with imprecise observations Uncertain revised regression analysis with responses of logarithmic, square root and reciprocal transformations Uncertain Gompertz regression model with imprecise observations Residual and confidence interval for uncertain regression model with imprecise observations Uncertain maximum likelihood estimation with application to uncertain regression analysis Uncertain growth model for the cumulative number of COVID-19 infections in China. Fuzzy Optimization and Decision Making Cross-validation for the uncertain Chapman-Richards growth model with imprecise observations Uncertainty theory Some research problems in uncertainty theory Uncertain urn problems and Ellsberg experiment Leave-p-out cross validation test for uncertain Verhulst-Pearl model with imprecise observations Cross validation for uncertain autoregressive model Least absolute deviations estimation for uncertain regression with imprecise observations. Fuzzy Optimization and Decision Making Coronavirus disease (COVID-19) situation reports Uncertain multivariable regression model Uncertain vector autoregressive model with imprecise observations Uncertain time series analysis with imprecise observations. Fuzzy Optimization and Decision Making Least-squares estimation for uncertain moving average model Least absolute deviations estimation for uncertain autoregressive model Uncertain regression analysis: An approach for imprecise observations Uncertain hypothesis test with application to uncertain regression analysis Multivariate uncertain regression model with imprecise observations Least absolute deviations for uncertain multivariate regression model Analytic solution of uncertain autoregressive model based on principle of least squares Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations