key: cord-0796162-um60l5ph authors: Chen, Dan title: Uncertain regression model with autoregressive time series errors date: 2021-10-21 journal: Soft comput DOI: 10.1007/s00500-021-06362-4 sha: 9a0636175486b8eff9affffbe1aa705443345779 doc_id: 796162 cord_uid: um60l5ph Uncertain regression model is a powerful analytical tool for exploring the relationship between explanatory variables and response variables. It is assumed that the errors of regression equations are independent. However, in many cases, the error terms are highly positively autocorrelated. Assuming that the errors have an autoregressive structure, this paper first proposes an uncertain regression model with autoregressive time series errors. Then, the principle of least squares is used to estimate the unknown parameters in the model. Besides, this new methodology is used to analyze and predict the cumulative number of confirmed COVID-19 cases in China. Finally, this paper gives a comparative analysis of uncertain regression model, difference plus uncertain autoregressive model, and uncertain regression model with autoregressive time series errors. From the comparison, it is concluded that the uncertain regression model with autoregressive time series errors can improve the accuracy of predictions compared with the uncertain regression model. Uncertain statistics is a set of mathematical techniques for collecting, analyzing, and interpreting data by uncertainty theory (Liu 2007) . Uncertain regression analysis as a branch of uncertain statistics is a set of statistical techniques that use uncertainty theory to explore the relationship between explanatory variables and response variables. The study of uncertain regression analysis was started by Yao and Liu (2018) by assuming that the disturbance term is an uncertain variable instead of a stochastic variable. To make point estimation for the unknown parameters in an uncertain regression model, they suggested the least squares estimation. Then, considered the least absolute deviations estimation, Chen (2020) investigated the Tukey biweight estimation, and Lio and Liu (2020) proposed the maximum likelihood estimation. Lio and Liu (2018) Ye and Liu (2021) introduced uncertain hypothesis test. In addition, uncertain regression analysis has been extended in various directions, such as multivariate regression analysis (Song and Fu 2018; Ye and Liu 2020; Zhang et al. 2020 ) and cross-validation (Liu and Jia 2020; Liu 2019) . Uncertain time series analysis as another branch of uncertain statistics is a set of statistical techniques that use uncertainty theory to predict future values based on the previously observed values. The study of uncertain time series analysis was started by Yang and Liu (2019) by assuming that the disturbance term is an uncertain variable instead of a stochastic variable. To explore the relationship between these observations, they presented an uncertain autoregressive (UAR) model. Furthermore, they applied the principle of least squares to estimating the unknown autoregressive parameters. Then, considered the least absolute deviations estimation, and Chen and Yang (2021a, b) investigated the ridge estimation and the maximum likelihood estimation. To determine the optimum order of the UAR model, gave cross-validation. In addition, some researchers studied other uncertain time series models, such as the 1-order uncertain moving average model (Yang and Ni 2021) and the uncertain vector autoregressive model (Tang 2020) . To estimate the unknown parameters in an uncertain differential equation that fits the observed data as much as possible, several methods were proposed, for example, method of moments , minimum cover estimation , least squares estimation (Sheng et al. 2020) , generalized moment estimation , and maximum likelihood estimation . To obtain the unknown initial value of uncertain differential equation based on observed data, Lio and Liu (2021) proposed an estimation method. In recent studies, these statistical techniques were utilized for COVID-19 prediction. For example, applied the uncertain logistic growth model to fitting the cumulative number of confirmed COVID-19 cases in China. Ye and Yang (2021) applied the UAR model to analyzing the second difference of the cumulative number. Following that, presented an uncertain SIR model, and Jia and Chen (2021) proposed an uncertain SEIAR model by employing high-dimensional uncertain differential equations. Concerned about the time when COVID-19 started spreading in China, Lio and Liu (2021) inferred the zero-day of COVID-19 spread using the initial value estimation. This paper proposes a new methodology for the analysis and prediction of time series data. In 1949, Cochrane and Orcutt (1949) presented evidence showing that the error terms involved in most current formulations of economic relations are highly positively autocorrelated. They indicated that in many cases the assumption of random error terms is not a very good approximation to the truth. Assuming that the errors are an autoregressive process with finite order, Durbin (1960) proposed a two-stage procedure that yields asymptotically efficient estimates in linear model. Similarly, it is an oversimplification to assume that error terms are independent in time in uncertain regression analysis. To improve this situation, in this paper a different type of model is considered in which the errors in the model have an autoregressive structure, i.e., uncertain regression model with autoregressive time series errors. The rest of this paper is organized as follows. Section 2 introduces an uncertain regression model with autoregressive time series errors in detail, including parameter estimation, residual analysis, forecast value, and confidence interval. In Sect. 3, the approach is applied to modeling the cumulative number of confirmed COVID-19 cases in China. A comparative study on uncertain regression model, difference plus UAR model, and uncertain regression model with autoregressive time series errors is analyzed in Sect. 4. Section 5 shows that stochastic regression model with autoregressive time series errors is not suitable. Finally, some conclusions are made in Sect. 6. The uncertain regression model with autoregressive time series errors has the form of where Y t is a response series, (X t1 , X t2 , · · · , X t p ) is a vector of explanatory series, β is an unknown vector of parameters, f (X t1 , X t2 , · · · , X t p |β) represents the effect of (X t1 , X t2 , · · · , X t p ) on Y t , and Z t is an error series for t = 1, 2, · · · , n. We assume that the errors of (1) follow a k-order uncertain autoregressive model, that is, where the autoregressive coefficients a 0 , a 1 , · · · , a k are unknown, and ε t are uncertain disturbances (uncertain variables) for t = k + 1, k + 2, · · · , n. Remark 1 Uncertain regression (linear or nonlinear) and uncertain autoregressive models are special cases of (1). When (X t1 , X t2 , · · · , X t p ) contains the time variable t, (1) is used by economists to study the trend of Y t . In the regression model with autoregressive time series errors (1), if f (X t1 , X t2 , · · · , X t p |β) is a linear function, i.e., then it is called a linear regression model with autoregressive time series errors. If f (X t1 , X t2 , · · · , X t p |β) is a logistic function, i.e., then it is called a logistic growth model with autoregressive time series errors. Assume (x t1 , x t2 , · · · , x t p , y t ) are the observed data at times t for t = 1, 2, · · · , n, respectively. Based on the observed data, the least squares estimate of β in the uncertain regression model with autoregressive time series errors is the solution, β * , of the minimization problem, Then, for each index t (t = 1, 2, · · · , n), the errors can be calculated as The errors z 1 , z 2 , · · · , z n will be regarded as the samples of Z t . The least squares estimate of (a 0 , a 1 , · · · , a k ) in the uncertain regression model with autoregressive time series errors is the solution, (a * 0 , a * 1 , · · · , a * k ), of the minimization problem, Thus, the fitted regression model with autoregressive time series errors is determined by Example 1 Let (x t1 , x t2 , · · · , x t p , y t ) be the observed data at times t for t = 1, 2, · · · , n, respectively. The least squares estimates of β 0 , β 1 , · · · , β p and a 0 , a 1 , · · · , a k in the linear regression model with autoregressive time series errors solve the minimization problems, and respectively, where for t = 1, 2, · · · , n. Example 2 Let (x t , y t ) be the observed data at times t for t = 1, 2, · · · , n, respectively. The least squares estimates of β 0 , β 1 , β 2 and a 0 , a 1 , · · · , a k in the logistic growth model with autoregressive time series errors solve the minimization problems, respectively, where for t = 1, 2, · · · , n. Definition 1 Let (x t1 , x t2 , · · · , x t p , y t ) be the observed data at times t for t = 1, 2, · · · , n, respectively, and let the fitted regression model with autoregressive time series errors be Then, for each index t (t = k + 1, k + 2, · · · , n), the term is called the t-th residual. The residualsε k+1 ,ε k+2 , · · · ,ε n will be regarded as the samples of the uncertain disturbance terms ε t in the uncertain regression model with autoregressive time series errors Thus, the expected value of ε t can be estimated as the average of residuals, i.e., and the variance can be estimated aŝ Therefore, we may assume the estimated disturbance terms ε t follow the normal uncertainty distribution N(ê,σ ). Let (x t1 , x t2 , · · · , x t p , y t ) be the observed data at times t for t = 1, 2, · · · , n, respectively, and let the fitted linear regression model with autoregressive time series errors be ⎧ ⎨ The expected value of estimated disturbance termsε t iŝ and the variance iŝ Example 4 Let (x t , y t ) be the observed data at times t for t = 1, 2, · · · , n, respectively, and let the fitted logistic growth model with autoregressive time series errors be The expected value of estimated disturbance termsε t iŝ and the variance iŝ Remark 3 After that, we apply uncertain hypothesis test (Ye and Liu 2021) to evaluating the appropriateness of fitted regression model with autoregressive time series errors and estimated disturbance terms. Let (x n+1,1 , x n+1,2 , · · · , x n+1, p ) be an explanatory vector at time n + 1. Assume (i) the fitted regression model with autoregressive time series errors is and (ii) the estimated disturbance termsε t follow the normal uncertainty distribution with expected valueê determined by (21) and varianceσ 2 determined by (22). The forecast uncertain variable of Y n+1 with respect to (x n+1,1 , x n+1,2 , · · · , x n+1, p ) is determined bŷ and the forecast value is defined as the expected value of the forecast uncertain variableŶ n+1 , i.e., Example 5 Let (x n+1,1 , x n+1,2 , · · · , x n+1, p ) be an explanatory vector at time n+1. Assume (i) the fitted linear regression model with autoregressive time series errors is and (ii) the estimated disturbance termsε t follow the normal uncertainty distribution with expected valueê and variancê σ 2 . The forecast uncertain variable of Y n+1 with respect to Example 6 Let x n+1 be an explanatory variable at time n + 1. Assume (i) the fitted logistic growth model with autoregressive time series errors is ⎧ ⎨ and (ii) the estimated disturbance termsε t follow the normal uncertainty distribution with expected valueê and variancê σ 2 . The forecast uncertain variable of Y n+1 with respect to and the forecast value of Y n+1 iŝ Let (x n+1,1 , x n+1,2 , · · · , x n+1, p ) be an explanatory vector at time n + 1. Assume the forecast uncertain variable of Y n+1 with respect to (x n+1,1 , x n+1,2 , · · · , x n+1, p ) iŝ Then, the forecast value of Y n+1 iŝ It follows from the operational law thatŶ n+1 has a normal uncertainty distribution N(ŷ n+1 ,σ ), i.e., Taking α as the confidence level, it is easy to verify that b =σ is the minimum value b such that Since In this section, the uncertain regression model with autoregressive time series errors is applied to analyzing the cumulative number of confirmed COVID-19 cases by local transmission in China. We use the same data for comparison with uncertain regression model ) and difference plus uncertain autoregressive model (Ye and Yang 2021) , that is, the cumulative number of confirmed COVID-19 cases D. Chen Table 1 ). The data are plotted in Fig. 1 . Let 1, 2, · · · , 40 represent the dates (t) from February 13 to March 23. For example, t = 1 and 40 represent February 13 and March 23, respectively. In order to determine the functional relationship between t (the date) and Y t (the cumulative number of confirmed COVID-19 cases on date t), we may use the observed data (t, y t ), t = 1, 2, · · · , 40 (44) where y t are the cumulative numbers shown in Table 1 on days t, t = 1, 2, · · · , 40, respectively. For example, In order to fit the above observed data, we employ the uncertain logistic growth model with autoregressive time series errors, where Z t is an error series for t = 1, 2, · · · , 40, and ε t are uncertain disturbances for t = k + 1, k + 2, · · · , 40. Based on the observed data (t, y t ), t = 1, 2, · · · , 40, Liu (2021) obtained the fitted logistic growth component Then, the observed data of the error series Z t are for t = 1, 2, · · · , 40 (see Table 2 ). Next, we apply the UAR(k) model to modeling z 1 , z 2 , · · · , z 40 . First we determine the value of the order k by rolling origin cross validation . Assume that T = 37, and the average testing error AT E(k) is where (a * 0 , a * 1 , · · · , a * k ) m are the least squares estimations using the observation data in the training sets {z 1 , z 2 , · · · , z 37+m } for m = 0, 1, 2, respectively. Table 3 provides a quick summary of the value of AT E(k) with k ∈ {1, 2, 3, 4, 5}. When k = 4, we get the minimum value of AT E(k). Thus the UAR component is Using the error data z 1 , z 2 , · · · , z 40 and solving the minimization problem we obtain a fitted UAR(4) component From we obtain 36 residualsε 5 ,ε 6 , · · · ,ε 40 shown in Fig. 2 . Thus, the expected value of estimated disturbance terms iŝ e = 1 36 and the variance iŝ Assume the estimated disturbance terms follow the normal uncertainty distribution N(0.0000, 96.0254). In order to test whether it is appropriate, given a significance level α = 0.01, the uncertain hypothesis test for the hypotheses respectively (see Table 4 ). Using the revised data z 1 , z 2 , · · · , z 40 , we obtain a new fitted UAR(4) component From we obtain 36 residualsε 5 ,ε 6 , · · · ,ε 40 shown in Fig. 3 . Thus, the expected value of estimated disturbance terms iŝ Assume the estimated disturbance terms follow the normal uncertainty distribution N(0.0002, 78.4578). In order to test whether it is appropriate, given a significance level α = 0.01, the uncertain hypothesis test for the hypotheses Since (ε 5 ,ε 6 , · · · ,ε 40 ) ∈ W , we reject H 0 . It follows from −198.7665 (68) that z 5 is an outlier, and is replaced with After repeating the iterative procedure 5 times, we obtain a new fitted UAR(4) component From we obtain 36 residualsε 5 ,ε 6 , · · · ,ε 40 shown in Fig. 4 . Thus, the expected value of estimated disturbance terms iŝ e = 1 36 and the variance iŝ Assume the estimated disturbance terms follow the normal uncertainty distribution N(0.0000, 53.4133). In order to test whether it is appropriate, given a significance level α = 0.01, the uncertain hypothesis test for the hypotheses is W = (w 5 , w 6 , · · · , w 40 ) − 135.3184 ≤ w t ≤ 135.3184, Since (ε 5 ,ε 6 , · · · ,ε 40 ) / ∈ W , we accept H 0 . That is, the normal uncertainty distribution N(0.0000, 53.4133) is appropriate. Using the fitted logistic growth model with autoregressive time series errors and the estimated disturbance terms, we obtain that the forecast uncertain variable of Y 41 on day 41 isŶ and the 95% confidence interval is That is, we predict that the cumulative number on March 24, 2020 will be 80755, and we are 95% sure that the number falls into [80647, 80862] . In this section, we compare uncertain regression model with autoregressive time series errors with uncertain regression model and difference plus UAR model (Ye and Yang 2021) . The estimated standard deviation (see Table 5 ) Table 5 , the uncertain logistic growth model with autoregressive time series errors can predict better. This model has less information loss, but more computation. Ye and Yang (2021) modeled the second difference of the cumulative cases series using the UAR(5) model. However, so far, there are no definitions of stationary and difference in uncertain time series analysis. In Reference (Ye and Yang 2021) , the differencing operation was based on stochastic time series analysis, and this methodology (i.e., difference plus UAR model) used for analyzing time series data was flawed. But in this paper, uncertainty theory supplies the theoretical justifications for the uncertain regression model with autoregressive time series errors, and this method can be used easily. In 2019, Aslam and Albassam (2019) used the neutrosophic regression model to study the relationship between prostate cancer and dietary fat level. The essential difference between neutrosophic regression and uncertain regression lies in statistical techniques. The former uses neutrosophic statistics which was introduced based on the idea of neutrosophic logic, while the latter uses uncertain statistics which was introduced based on uncertainty theory. In other words, the difference between neutrosophic regression and uncertain regression is that the former deals with the data having Neutrosophy, inexact values, unclear observations, and interval values, while the latter deals with the data containing precise and exact observations. The former provides the parameters, confidence interval and p-values in the indeterminacy interval range, while the latter provides the determined values of all parameters. The difference between traditional regression model with autoregressive time series errors and uncertain regression model with autoregressive time series errors lies in how the disturbance terms are assumed. The former assumes the disturbance term is a stochastic variable, while the latter assumes the disturbance term is an uncertain variable. Since random variables and uncertain variables obey different operational laws, wrong assumptions may mislead the decision-maker. In the example of COVID-19, we use the Lilliefors test for testing the normality of the residuals (see Fig. 4 ). The test results show that the null hypothesis is rejected. That is, the disturbance term cannot be characterized as a normal random variable. Therefore, stochastic regression model with autoregressive time series errors is not suitable for modeling cumulative number. This paper firstly proposed a new model, i.e., the uncertain regression model with autoregressive time series errors. Then, the principle of least squares was used to estimate the unknown parameters. Finally, we made a comparative analysis. The conclusion was that the uncertain regression model with autoregressive time series errors can improve the accuracy of predictions compared with the uncertain regression model. In future research, we will investigate how to deal with imprecise observations using neutrosophic statistics or uncertain statistics. In addition, referring to goodness of fit test (Aslam 2021a), analysis of means (Aslam 2021c) , and skewness and kurtosis estimators (Aslam 2021b), these techniques can be introduced into uncertain statistics as a future research endeavor. A new goodness of fit test in the presence of uncertain parameters A study on skewness and kurtosis estimators of wind speed distribution under indeterminacy Analyzing wind power data using analysis of means under neutrosophic statistics Application of neutrosophic logic to evaluate correlation between prostate cancer mortality and dietary fat assumption Tukeys biweight estimation for uncertain regression model with imprecise observations Maximum likelihood estimation for uncertain autoregressive model with application to carbon dioxide emissions Ridge estimation for uncertain autoregressive model with imprecise observations Numerical solution and parameter estimation for uncertain SIR model with application to COVID-19 Application of least squares regression to relationships containing autocorrelated error terms Estimation of parameters in time-series regression models Uncertain SEIAR model for COVID-19 cases in China Residual and confidence interval for uncertain regression model with imprecise observations Uncertain maximum likelihood estimation with application to uncertain regression analysis Initial value estimation of uncertain differential equations and zero-day of COVID-19 spread in China Leave-p-out cross-validation test for uncertain Verhulst-Pearl model with imprecise observations Uncertain growth model for the cumulative number of COVID-19 infections in China Generalized moment estimation for uncertain differential equations Cross-validation for the uncertain Chapman-Richards growth model with imprecise observations Cross validation for uncertain autoregressive model Least absolute deviations estimation for uncertain regression with imprecise observations Estimating unknown parameters in uncertain differential equation by maximum likelihood estimation Uncertain multivariable regression model Uncertain vector autoregressive model with imprecise observations Uncertain time series analysis with imprecise observations Least-squares estimation for uncertain moving average model Parameter estimation of uncertain differential equation with application to financial market Least absolute deviations estimation for uncertain autoregressive model Uncertain regression analysis: an approach for imprecise observations Parameter estimation in uncertain differential equations Multivariate uncertain regression model with imprecise observations Uncertain hypothesis test with application to uncertain regression analysis. Fuzzy Optim Decis Mak Analysis and prediction of confirmed COVID-19 cases in China with uncertain time series Least absolute deviations for uncertain multivariate regression model Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations Acknowledgements This work was supported by the National Natural Science Foundation of China (Grants No.61873329 and 61873084). The authors declare that they have no conflict of interest.Data availability All data generated or analyzed during this study are included in Tables 1 and 2 .Code availability All the codes implemented during this study are available from the corresponding author on reasonable request.Ethical approval This paper does not contain any studies with human participants or animals performed by any of the authors.