key: cord-0991350-334q747d authors: Knopov, P. S.; Korkhin, A. S. title: Statistical Analysis of the Dynamics of Coronavirus Cases using Stepwise Switching Regression date: 2020-11-25 journal: Cybern Syst Anal DOI: 10.1007/s10559-020-00314-w sha: d1d51b75ef8856745a996b6612a0cb59d9e91794 doc_id: 991350 cord_uid: 334q747d The dynamics of coronavirus cases is proposed to be modeled using switching regression whose switching points are unknown. The stepwise process of constructing the regression in time is described. The dynamics of the number of coronavirus cases in Ukraine is analyzed. Switching regression is a set of regression models arranged sequentially in time, which can be both not related or related to each other. Regressions are divided one from another by switching points, which are often unknown. This case is a subject of the study. Noteworthy are the studies by P. Perron with co-authors (see, for example, [1] [2] [3] ). They propose to use the Bellman and Roth algorithm [4] for estimation of switching points by the dynamic programming method. Developments by Perron and his co-authors were used to solve economic problems. The authors of the present paper also propose a number of results in generating switching regressions. The studies [5, 6] describe the methods of estimating switching points based on the given sampling, which allow taking into account the constraints imposed on these switching points and regression parameters that follow from the a priori information about the process being modeled. Such constraints cannot be taken into account when a dynamic programming scheme is used. In some applications (for example, economy, public health services), the concept of a fixed observation interval, which is used in [1-3, 5, 6] , is not always acceptable in view of continuous data renewal. As an example, we will consider the coronavirus infection process, which is of current concern. In the paper, we propose to analyze the infection dynamics based on the switching regression model and create the model by steps in time. The observation interval, whose length is fixed or increases in time, is divided into a sequence of rather short overlapping intervals I j , j = 1 2 , ,K , which a priori contain a small number of switching points, for example, no more than two. This considerably simplifies the problem of their estimation. Figure 1 (whose data are taken from [7] ) shows a time series of the daily number of coronavirus cases (NCC) in Ukraine since April 12, 2020, when NCC level had an evident growth tendency. As is seen from the figure, the rate of NCC variation is not constant: it varies in time not only in the value but also in the sign. Therefore, it is expedient to describe NCC variation by a linear switching regression. An analysis of the NCC dynamics has shown that regular oscillations with one week phase are typical for it. This fact should be taken into account when creating a switching regression. The NCC time series consists of the trend TR , regular oscillations K , and a random component E . The trend is a sequence of straight lines separated one from another by switching points, which not necessarily should be linked up at these points. Regular oscillations are apparently related to labor activity. Random oscillations are caused by a set of minor factors such as coronavirus testing errors and occurrence of random centers of infection. It is natural to assume the additive structure of NCC time series defined by the model where y is the number of cases. Let us determine the values on the right-hand side of (1). First, let us eliminate the trend from y by using moving average with the averaging interval equal to seven (the number of days in a week): , t = 4 5 , ,K , where y i is the NCC at the tth day (t = 1 corresponds to 04.12.2020). According to (1), the difference u y y t i t = -, t = 4 5 , ,K , is the sum K E + of regular oscillations and of a random component. The value of u t depends on the day of the week to which the number of the day t corresponds. It can be presented by the sum u K E where W i is the set of indices of days in the NCC time series to which day i of the week corresponds; | | W i is the number of elements in W i . According to (2) , six observations are lost when $ K i is calculated using (3) From (3), we get $ Due to symmetry of the regular oscillations with respect to the trend, which can be assumed according to Fig. 1 , Estimates of regular oscillations for a finite number of observations, defined by (3), will not satisfy this condition. Therefore, it is necessary to modify them to obtain Let us describe the obtained sum of the trend and random component on each time interval I j , j = 1 2 , ,K , on which the estimation is carried out, by a switching regression consisting of k +1 linear regressions: Here, k k j = ( ) is the number of switching points; a 0 0 i and a 1 0 K , are unknown switching points; t 0 j and t 1 j are respectively the beginning and the end of the interval I j ; and e ti are random components of regressions that characterize the influence of minor factors on the NCC. The superscript 0 of the above-mentioned quantities specifies that the value of the regression parameter or of the switching point is true. Hereinafter, to simplify the notation, we omit the subscript j of the interval I j for k, parameters, and switching points of the regression. Since random components in (5) are subsequences of the sequence E t , t = 1 2 , ,K , according to its properties presented above they correspond to standard assumptions of regression analysis: do not correlate with each other, have zero expectations, and identical variances s 2 . Let us present the aforesaid as assumptions. Assumption 1. Random components of regressions (5) have the following characteristics: random components of one regression are uncorrelated: random components of different regressions are also uncorrelated: Here, q i 0 is the time interval on which the regression parameters a 0 Assumption 2. Random components of the regressions (5) are normally distributed. As is seen from Fig. 1 , a substantial difference between NCC in two adjacent days is possible. Therefore, intervals of the straight lines a a 0 0 1 , should not necessarily form a continuous piecewise linear function t , i.e., this function can be discontinuous at switching points. Properties of the random components of the regression (6)-(9) stipulate the problem of estimation of switching points and parameters of switching regression on each estimation interval I j , j = 1 2 , ,K : Minimization in the problem (10), (11) is carried out with respect to the parameters a 0i and a 1i , i k = + 1 1 , , K , continuous quantities and integer switching points t i , i k = 1, , K . Since these quantities vary, their superscript 0 is omitted. To variable switching points in (10), (11) there correspond intervals q i of constancy of parameters with variable ends: The constraint (11) establishes the minimum number of observations: two for estimation of two parameters of each of the (k +1)th straight line on the intervals q i , i k = + 1 1 , , K , of constancy of the switching regression parameters. The problem (10), (11) can be considered as a generalization of the problem of parameter estimation of a constrained nonlinear regression [9, Ch. 1, 10-13]. Since it contains unknown integers, it was solved by a special method [5] . Below, we present the process and results of the solution for overlapping estimation intervals I I I I ยน AE. The values of $ K i , i = 1 7 , , K , were found from the NCC at the end of the respective interval. For example, when constructing a switching regression on the interval I 1 , these values were calculated based on observations over this interval: from 04.12.20 till 05.12.20. When constructing regression on the interval I 2 , we used observations beginning with 04.12.20 till the end of this interval 06.21.20. Such approach has allowed, in particular, simulating step-by-step inflow of NCC data, which actually took place. Note that other approaches, based on discrete optimization methods, can also be proposed for solution of the considered problem [14, 15] . According to Fig. 3 , on the interval I 1 = [04.12.20, 05.12.20], containing 31 observations, there are no more than one switching point at which NCC can vary (decrease or increase). Adding one switching point, where velocity probably varies, we obtain k = 2 in the estimation problem (10), (11) . Its solution is estimates of switching points: $ t 1 4 = and $ t 2 19 = . Estimate of the length of the first interval of constancy of regression parameters q 1 0 equals four. It is too small, which allows us to advance the null hypothesis H 0 : a a (5) coincide). Assuming that the second switching point is fixed, to test the formulated hypothesis on the basis of Assumptions 1 and 2, we will use the criterion from [16] . According to it, we find S, the squared sum of deviations of the linear regression with n parameters of observations on the time interval of length T; S 1 and S 2 are the sums of squared deviations of two other linear regressions: the first and the second ones with n parameters each of the same observations on the time intervals of length, respectively, T 1 and T 2 , and T T T As a result, we obtained a switching regression with the regression line being a linear spline (see Fig. 3 ). The sum of squared deviations (10) for this case S = 42559.85 slightly increased after adding the constraint to the estimation problem. This testifies that above-mentioned discrepancy of two straight lines in $ t 1 was due to random factors. Approximate accuracy analysis of the estimates of parameters of the obtained spline was based on the fact that the obtained switching point coincides with the true one. Such assumption is not a severe constraint for the problem under study since time, rather than some complex function, is an independent variable. Then according to [8, Ch. 15] we will consider the obtained two regressions as one, whose variable parameters are combined by Eq. (14) , where t t 1 1 = $ is fixed. Then the covariance matrix of regression parameter estimates is defined by the expression where the prime denotes transposition, J 2 1 In the case under study, X X X = diag ( , ) where O 1m is the row with m columns, m k kn = + -2 1 ( ) . For m = 0, O 10 means that the row is absent. If straight lines do not mate at some points, then the corresponding rows of the matrix G are deleted. In case of zero matrix G, which corresponds to the absence of constraints on regression parameters, from (15) ) . The significance of the estimates of regression parameters for the fixed switching points was determined on the basis of (15) and Assumptions 1 and 2 by well-known methods. Noncorrelatedness analysis of random components of the regression was carried out by the Darbin-Watson criterion, and analysis of their normality by the criterion based on asymmetry and kurtosis coefficients [17, Sec. 3.2] , for its application see [18, Sec. 9.3] . For the case k = 1, according to the formula (15), estimate of the covariance matrix $ V was found by replacing s 2 with its estimate s 2 1464 = . Estimates of spline parameters turned out to be significant at the 5%-level, except for the estimate $ a 12 = -1.374. Therefore, the estimation problem (10), (11) with the added constraints (14) and a 12 0 = was solved. As a result, a spline that describes smooth passage from NCC growth to a plateau, a horizontal section (see Fig. 3 ) with highly significant nonzero estimates of its parameters, was obtained: $ a 01 = 340.69; $ a 11 =11.56; and $ a 02 = 467.81. The estimate of the covariance matrix V was calculated according to the initial data (16) , where matrix X 2 is a column of ones. The beginning of the interval I 2 = [04.23.20, 06.21.20] coincides with the beginning of the plateau, the switching point defined on I 1 , plus 1. Then leaving the plateau towards increase or decrease of the NCC is possible. A repeated change in the direction of the NCC dynamics is possible. Therefore (as well as in Sec. 2), we will suppose in the estimation problem that k = 2. Let us establish t 02 1 = , t 12 60 = . Thus, time reference in I 2 begins with one. Solving the estimation problem for I 2 , we obtain estimates of two switching points: $ (digits in brackets mean the significance of the parameter, which was determined in the same way as in Sec. 2). According to (18) , at point t = 37 the plateau has ended. It began at t = 1 and has a small declination with a significant angular coefficient whose estimate -3.335. According to Sec. 2, its value is insignificant and the estimate is 1.374; it is possible to explain such a discrepancy by a great amount of data on the plateau on the interval I 2 . Straight lines of the second and third regressions, according to (18) , have angular coefficients of identical signs, and the angular coefficient of the second straight line is insignificant at the 1%-level. Therefore, the null hypothesis about equality of parameters of the specified regressions was advanced and was tested with the help of statistics (13) , where T = 23 and n = 2 . It was obtained that F * = 4.97. Since F 0 05 2 19 . ( , ) = 3.52, the null hypothesis was rejected. Let us now predict the NCC based on the obtained switching regression on the considered interval, which means extrapolation of the straight line of the third regression. From here, we get the pointwise prediction of the NCC where the prime denotes transposition. Due to Assumptions 1 and 2, we get the interval prediction where u q p ( ) is the 100 p%-point of the Student distribution with the number of degrees of freedom q T k n = -+ ( ) 1 ; $ ( ) s f t is the estimate of the mean square deviation of the prediction. We will find it based on (1); as a result, we a ', t T > . This equality and (19) yield the prediction error: Average prediction error on the time interval [61, 66], where there was only one NCC value that did not get into the confidence interval, was determined as the ratio of the estimate of the root mean square deviation of the prediction to the average true value of the NCC on the interval [ , ] 61 66 . It was 9.9%. As calculations have shown, the prediction error can be reduced to 7.4% if we make the estimate $ a 03 significant, i.e., increase its accuracy. This can be attained by jointing the second and third regressions at the point t = 48 . For comparison, average relative error of the prediction for eight days for the model obtained on the interval I 1 was 19%. Such a large error can be explained by a low accuracy of estimation of regular oscillations of NCC on the small number of data equal to 31. where the notation in brackets have the same sense as in (18) . According to (22), all the angular coefficients, except for $ a 13 , are highly significant, at the point $ t 1 sharp decrease has begun, followed by slow decrease at $ t 2 . In view of the proximity of these points, the hypothesis about equality of parameters of the second and third regressions was tested, assuming that $ t 1 63 = is approximately known since sharp decrease of the NCC has begun at its neighborhood. According to the criterion (13), this assumption was rejected at the 5%-level: F * = 5.36, F 0 05 2 9 . ( , ) = 4.26. Sharp decrease in the NCC can be due to liquidation of one or several centers of infection. Insignificance of $ a 13 can be explained by the fact that infection dynamics probably has gone out to the next plateau for a short while. In the paper, we have considered a model of the dynamics of coronavirus cases in the form of a switching regression whose switching points are unknown. We have proposed stepwise solution of the problem of its creation. First, at each step, we solved the estimation problem for two switching points based on observations on some small time interval I j , j = 1 2 3 , , . Then we carried out statistical analysis of the constructed part of regression, which considerably simplified the estimation problem. As a result, the required regression was determined on three sequential time intervals: I I I \ I , and I 3 . Completely, the line of this regression was obtained in the form of a piecewise linear function of time since the ends of these intervals coincide with the last switching points of the previous interval plus one or with the ends of the observation interval (see Fig. 1 ). The use of prediction allows obtaining a satisfactory approximation to NCC if the process is between two switching points, and determining if it hits the domain where a new switching point can be. The latter is important in making a decision to step up (ease) the quarantine and related measures. As is shown in the paper, without a priori information about the number of switching points, five such points are found and the pattern of NCC variation (growth, decay, stabilization) to the right from these points based on a small number of observations available in this domain is determined. Thus, switching regression allows not only obtaining a short-term prediction of the dynamics of the epidemy, but also promptly determining the trend of its development. Note that even in case of 87 observations at the very beginning of the research, finding all the switching points simultaneously when their number is not known would be a challenge. The idea of stepwise estimation has simplified the solution. Estimating and testing linear models with multiple structural changes Computation and analysis of multiple structural change models Structural Breaks in Time Series Curve fitting by segmented straight lines Constructing a switching regression with unknown switching points Continuous-time switching regression method with unknown switching points Coronavirus Infection (COVID-19) Estimation of regression model parameters with specific constraints Estimation of reliability parameters under incomplete primary information Mathematical models and methods of riks assessment in ecologically hazardous industries Method of empirical means in stochastic programming problems Nonparametric estimate of almost periodic signals Problems of discrete optimization: Challenges and main approaches to solve them Estimating the size of correcting codes using extremal graph problems Tests of equality between sets of coefficients in two linear regressions Theoretical Statistics