key: cord-0059987-652hfypl authors: Klyushin, D. A. title: Nonparametric Analysis of Tracking Data in the Context of COVID-19 Pandemic date: 2020-07-29 journal: Big Data Analytics and Artificial Intelligence Against COVID-19: Innovation Vision and Approach DOI: 10.1007/978-3-030-55258-9_3 sha: 1734cf1f9fb344eaf4ac50ce005f2c726f939470 doc_id: 59987 cord_uid: 652hfypl Methods of statistical pattern recognition are powerful tools for analysis of statistical data on COVID-19 pandemic. In this chapter, we offer a new effective online algorithm for detection of change-points in tracking data (movement data, health rate data etc.). We developed a non-parametric test for evaluation of the statistical hypothesis that data in two adjacent time intervals have the same distribution. In the context of the mobile phone tracking this means that the coordinates of the tracked object does not deviated from the base point significantly. For estimation of the health rate it means the absence of significant deviations from the norm. The significance level for the test is less than 0.05. The test permits ties in samples. Also, we show that results of comparison of the test with well-known Kolmogorov–Smirnov test. These results show that the proposed test is more robust, sensitive and accurate than the alternative ones. In addition, the new method does not require high computational capacity. In the midst of the COVID-19 pandemic, an avalanche of information is crashing down on scientists and decision makers. Among this huge volume of data ones of the most important are geolocation data of millions of mobile phone users in different regions and health data obtained from wearable devices. This data allows tracking migration of isolated patients and prevent their uncontrolled walks. Also, it allows detection of significance changes in the health rate of a user (temperature, pulse etc.). As far these data have a random nature, quite high rate of arrival and volatility, it is impossible to correctly process and analyze then manually, and therefore make the right decisions based on it without automatic analysis. In this regard, the development of artificial intelligence methods for the analysis of sequential data about isolated patient movements and analysis of health data of wearable devices users becomes an urgent task. In particular, now there were developed highly effective methods for decision making during the COVID-19 pandemic [1] , diagnosis and prediction models for COVID-19 patients response to treatment [2] , forecasting of coronavirus outbreak [3, 4] , and methods of analysis of computer tomography images of COVID-19 patients [5] . But, the search of current publications in the repositories of preprints arXiv and medRxiv shows that yet there are no papers on analysis of data obtained via wearing gadgets. Storing a huge amount of information is one of the main problems of big data. Offline algorithms using complete time series become impossible in such cases. Therefore, we must use online algorithms of time series analysis for detection of change points. In the context of tracking mobile phones and wearable devices the change of time series means the deviation of data from a reference point. It is convenient to use sliding window to analyze such data. However, this approach has two draw backs. First, the sliding window size should be small enough to prevent the problems with storing data. Second, due to the small size of the sliding window algorithms of analysis must be quite sensitive and recognize changes using small samples. In this work, we develop such an algorithm and show its advantages over an alternative (the Kolmogorov-Smirnov statistics). In the chapter it is proposed to use the novel nonparametric test for checking the statistical hypothesis that two given samples belong to the same distribution, i.e. they are homogeneous. This test is based on the measure of samples homogeneity, which is used to reject the hypothesis that the samples are identically distributed at a given significance level. The change-point detection is equivalent to change in distribution of data in adjacent intervals. Therefore, we reformulate this problem as the problem of two-sample homogeneity and solve it using the sliding window approach. The motivation of this chapter is to describe a new approach to change-point detection in time series and demonstrate its application in the context of COVID-19 pandemic (analysis of geolocation tracking data, detection of changes in health data of patient etc.). Now, one of the most popular online statistical methods for solving such problems is the Kolmogorov-Smirnov test [6] [7] [8] [9] [10] . This non-parametric test is simple and effective when samples are not overlapping or they have small overlapping. But, it is very sensitive to outliers and may produce high level of falsepositive responses when the samples are highly overlapping. The novelty of the proposed approach is that it is based on the Klyushin-Petunin non-parametric test for samples' homogeneity that allows very effective comparing not only non-overlapping samples but the samples with highly overlapping. It is the first application of the Klyushin-Petunin test for detection of change-point of time series and as it will be shown it provides more effective estimation of homogeneity of overlapping samples comparing with the Kolmogorov-Smirnov test. The contribution of this chapter is that it describes a fast, accurate and robust online algorithm to detect change-points in time series that may be an effective tool for measuring the violation of quarantine regime, keeping the social distance between people, and remote controlling the health data of patients suffering from COVID-19. The chapter is organized in the following way. Section 2 describes the homogeneity measure for samples without ties. Section 3 describes the homogeneity measure for samples with ties. Section 4 contains the results of the experiments with samples from different distributions (normal, lognormal, uniform and gamma distributions) with different degree of overlapping (from completely overlapped samples to disjoined samples) with ties and without ties. Section 5 summarizes the chapter. Consider samples u = (u 1 , u 2 , . . . , u n ) ∈ G 1 and v = (v 1 , v 2 , . . . , v n ) ∈ G 2 from populations G 1 and G 2 with distribution functions F 1 and F 2 that are absolutely continuous. The null hypothesis states that F 1 = F 2 and the alternative hypothesis states that F 1 = F 2 . There are several categories of criteria for testing such hypotheses: permutation criteria, rank criteria, randomization criteria, and distance criteria. In addition, these tests are divided into universal tests that are valid against any pair of alternatives (for example, Kolmogorov-Smirnov criterion [11, 12] , and criteria that are correct against pairs of different alternatives of a particular class (Dickson [13] , Wald and Wolfowitz [14] , Mathisen [15] , Wilcoxon [16] , Mann-Whitney [17] , Wilks [18] etc.). Also, they could be divided into two large groups: nonparametric and conditionally nonparametric ones. Nonparametric criteria are criteria for testing the hypothesis of the homogeneity of general populations regardless of their distribution assumptions [11] [12] [13] [14] [15] [16] [17] [18] . Conditionally non-parametric criteria (Pitman [19] , Lehmann [20] , Rosenblatt [21] , Dwass [22] , Fisz [23] , Barnard [24] , Birnbaum [25] , Jockel [26] , Allen [27] , Efron and Tibshirani [28] , Dufour and Farhat [29] ) use some assumptions on distributions. According to Hill's assumption A (n) [30] , if random values u 1 , u 2 , . . . , u n ∈ G are exchangeable and belong to absolutely continuous distribution then where j > i, u ∈ G is a sample value, and u (i) , u ( j) is an interval formed by the i-th and j-th order statistics. This assumption was proved in papers of Yu.I. Petunin et al. for independent identically distributed random values [31] and for exchangeable identically distributed random values [32] . Also, it was developed a nonparametric test for detecting the homogeneity of samples without ties [33] . Later, this test was extended on the samples with ties [34] . These tests estimate homogeneity of samples u 1 , u 2 , . . . , u n and v 1 , v 2 , . . . , v n under the strong random experiment and does not depends on their distribution. Suppose that F 1 = F 2 and construct the variational series u (1) , u (2) , . . . , u (n) . Denote as A (k) i j an event v k ∈ u (i) , u ( j) . According to the Hill's assumption, for j > i where h (n,k) i j is the relative frequency of the occurrence of the event A (k) i j in n trials. Then, construct the confidence interval with a significance level defined by the parameter g. Note, that if g is equal to 3 than the significance level of I (n) i j is less than 0.05 [33] . Also, the value of p-statistics very weakly depends on the choice of the confidence intervals for a binomial proportion [35] . Let B be an event Put N = (n − 1)n 2 and find L = # B. Then, h = L N is a homogeneity measure of samples x and y which we shall call p-statistics. Let us put h (n) i j = h, n = N and g = 3, and construct the Wilson (for definiteness) confidence interval I n = ( p 1 , p 2 ) for p(B). The confidence intervals (1) i j , p 2 i j and I = ( p 1 , p 2 ) are called intervals based on the 3 s-rule. The scheme of trials, where the events A (k) i j can arise when the hypothesis that distributions are identical holds is true, called a generalized Bernoulli scheme [36] [37] [38] . If the hypothesis does not hold, this scheme is called a modified Bernoulli scheme. In general case, when the null hypothesis can be either true or false, the trial scheme is called MP-scheme (Matveychuk-Petunin scheme). If F 1 = F 2 , lim n→∞ j−i n+1 ∈ (0, 1), and lim n→∞ i n+1 ∈ (0, 1), then the asymptotic significance level β of a sequence of confidence intervals I (n) i j based on the 3 s-rule, is less than 0.05 [33] . Let B 1 , B 2 , . . . be a sequence of events that may arise in a random experiment E, . . be a sequence of relative frequencies of the events B 1 , B 2 , . . ., respectively, and k n k → 0 as k → ∞. We shall call an experiment E a strong random experiment if h n k (B k ) → p * as k → ∞. In a strong random experiment the asymptotical significance level of the Wilson confidence interval I n as g = 3 is less than 0.05. The test for the null hypothesis F 1 = F 2 with a significance level, which is less that 0.05, is following: if I n contains 0.95, the null hypothesis is accepted, else the null hypothesis is rejected. Unfortunately, the main assumption used in the test based on p-statistics is that the distributions F 1 and F 2 are absolutely continuous, i.e. the sample must not contain any repetitions. Therefore, it is necessary to extend the above ideas to samples with repetitions, i.e. ties. Consider the sample u = (u 1 , u 2 , . . . , u n ) from distribution F obtained in a strong random experiment. A tie is a sample value u k that has duplicates. Multiplicity t(u k ) is the number of duplicates u k in u. If a joint distribution function is absolutely continuous and there are no duplicates in a sample, then the probability of the tie in the sample is equal to zero. In practice, we obtain sample values as the result of measuring a quantitative features. The measurements have a restricted accuracy, so a sample may contain ties. We propose a modification of p-statistics that could be used to estimate homogeneity of samples containing ties. Consider a sample u = (u 1 , u 2 , . . . , u n ) drawn from a population G with a distribution function that is absolutely continuous. If the sample values u k are measured absolutely precisely we shall call the sample u = (u 1 , u 2 , . . . , u n ) hypothetical. If the sample valuesũ k ∈ũ are approximations of the elements of the hypothetical sample u, we shall call the sampleũ = (ũ 1 ,ũ 2 , . . . ,ũ n ) empirical. PopulationG we shall call an empirical population corresponding to G. As a rule, real empirical samples contain ties. Denote as P α the flooring of a value: Let u (1) < u (2) < · · · < u (n) andũ (1) <ũ (2) < · · · <ũ (n) be variational series corresponding to the hypothetical and empirical samples. For a sample value u * that is drawn from G independently from u the following equation holds [30] : where k = 0, 1, . . . , n, u (0) = −∞, and u (n+1) = ∞. This formula may be extended to empirical samples [34] . Theorem If the distribution F is differentiable and Lipschitz continuous, i.e. |F(x) − F(y)| ≤ K |x − y| the sample value u * is independent on u, and the order statisticsũ k = P α u (i) , k ≤ i of the empirical sampleũ = P α (u) = (ũ 1 ,ũ 2 , . . . ,ũ n ) is a tie with multiplicity t(ũ k ), then Corollary With precision to the rounding error Thus, if p ũ * ∈ ũ (i) ,ũ ( j) , i < j, and 1 ≤ i, j ≤ n, then where Remark Ifũ (l) , i ≤ l ≤ j − 1 is not a tie, then γ l = 0 and (5) transforms to (1) . Denote by H the null hypothesis on identity of absolutely continuous distribution functions F 1 i F 2 of populations G 1 and G 2 respectively. Considerũ = (ũ 1 , . . . ,ũ n ) ∈ G 1 andũ = ũ 1 , . . . ,ũ n ∈G 2 , and letũ (1) ≤ · · · ≤ũ (n) , u (1) ≤ · · · ≤ũ (n) be their variance series, whereG 1 andG 2 are the empirical populations corresponding to G 1 and G 2 . Suppose that F 1 = F 2 and denote by A (k) i j k = 1, 2, . . . , n the random event ũ k ∈ ũ (i) ,ũ ( j) . If i j = h (n) , n = N and g = 3, using (1) we obtain the confidence interval I (n) = p (1) , p (2) for the probability p(B) = 1 − β. The statistics h (n) is called empirical p-statistics. It estimates the homogeneity of samplesũ andũ . Let us divide the time series in portions of size n and consider two windows of lengths n: In this experiment we use samples generated with the given distribution. If the samples x 1 , x 2 , . . . , x n and x n+1 , x 2 , . . . , x 2n are heterogeneous, i.e. they have different distributions, then the point x n+1 is a change point, that is a tracked date deviate from the reference point significantly (Fig. 1) . The windows must have equal sizes but may have different starting points. As an initial point we selected two windows with the same left end, let one window be fixed and the second window be moving with the step l (for example, l = 1) overlapping with the first window. As a variant, we can consider adjacent non-overlapping windows also. Hereinafter N(m, v) is a normal distribution, where m is the expected value and v is the variance , LN(m, v) is a lognormal distribution, where m is the expected value and v is the variance, U(a, b) is the uniform distribution on an interval (a, b), and G(a, b) is a two-parameter gamma distribution. For comparison we selected p-statistics and wide-used Kolmogorov-Smirnov statistics. The width of sliding window is equal to 40 (due to the requirement that the significance level of Klyushin-Petunin test must be less than 0.05). The sliding step is equal to 1. The averaging was made on 10 experiments. The results demonstrate the high effectiveness of the proposed method (see Figs. 2, 3, 4 and 5). Here, we selected the distribution parameters such that to demonstrate the deviation of the expected values of the samples from the reference point, but the same experiments may be conducted with the same means and various standard deviation. As was shown in [33] the p-statistics has great advantage in these cases. As we see, the curves of p-statistics and the Kolmogorov-Smirnov statistics decrease from the left upper corner (corresponding to the same samples which imply the high homogeneity measure) toward right lower corner (corresponding to the very different samples which implies the low homogeneity measure). The sensitiveness is estimated by the slope of these curves (the more steep curve means the more sensitive homogeneity measure). The monotonicity is estimated visually. The results show that the p-statistics is monotonic, sensitive, and robust homogeneity measure. Since the step of sliding window is equal to 1, at every time step the first sample becomes more "contaminated" by the sample values of the second sample. Thus, we handle the variants with different degrees of overlapping between samples: from total coinciding to disjoining. As we see, in many variants of distributions the Kolmogorov-Smirnov statistics as opposed to the p-statistics demonstrate non-monotonic and nor-robust behavior. There is one more method to estimate the sensitiveness of the test statistics. Remember, that the null hypothesis about identity of distribution functions in the Kolmogorov-Smirnov test is rejected if the Kolmogorov-Smirnov statistics is less than 0.05. Vice versa, in the Klyushin-Petunin test the null hypothesis is rejected when the confidence interval for the p-statistics does not contain 0.95. As we see on Fig. 2 and Fig. 4 (shifted normal and log-normal distributions) , the p-statistics rejects the null hypothesis starting from the half point, while the Kolmogorov-Smirnov statistics rejects the null hypothesis starting from the one quarter point. Therefore, in the cases of shifted normal and lognormal distributions the Kolmogorov-Smirnov test is more sensitive. From the other side, Figs. 3 and 5 demonstrate the advantage of the p-statistics in the case when distributions are different but significantly overlapping. As previously, the p-statistics rejects the null hypothesis starting from the half point but in the cases of overlapped uniform and gamma distributions the Kolmogorov-Smirnov test cannot distinguish them. We may conclude that the Kolmogorov-Smirnov statistics is more preferable when the distributions are shifted (i.e. they have different expected values) and the p-statistics in useful in the same cases but overcome the Kolmogorov-Smirnov test in the cases when the distributions have significant overlapping (for example, normal distributions with the same expected values and different variances or uniform distributions on overlapping intervals). It may be explained by the different nature of these statistical tests. The Kolmogorov-Smirnov test detects the difference between the empirical distributions functions and is more sensitive to shifts of distributions and outliers. Instead, the p-statistics uses the information on permutations and can detect both shifting and changes of the shapes of distributions remaining robust to outliers. Next, let us consider two quasi-real experiments imitation the deviation of the tracked object from the reference point. Obviously, the real experiment requires real data but there are private and inaccessible, so we just demonstrate the capabilities of the proposed test for solving such problems on the artificial data. Case 1 is illustrated by Fig. 1 . We simulate the time series consisting of two slightly overlapping samples: the first sample consists of 50 random values from the uniform distribution U(0,1) and the second sample consists of 50 random values from U(0,1) plus 0.7. We compare two consecutive sliding windows. The width of sliding window is equal to 10. The sliding step is equal to 10. The largest difference is expected between 5th and 6th segments (5th row of Table 1 ). Indeed, in this case we obtain the minimal value of p-statistics (0.8444) and according to the Klyushin-Petunin test [33] the null hypothesis on homogeneity of the sample is rejected because the confidence interval for the p-statistics does not contain 0.95). The Kolmogorov-Smirnov test demonstrates the over-conservativeness. According to its p-value the null hypothesis is rejected in the first, second, fourth, sixth and eights cases besides that by assumption first five segment and the second five segments were drawn from In order to demonstrate the difference between empirical p-statistics and original p-statistics let us consider the case for samples rounded up to two decimal digits. As we see, the original p-statistics overestimate the homogeneity of the rounded samples (see Table 2 ). The Kolmogorov-Smirnov statistics demonstrate the same properties as in Case 1. But these results may be explained by the small size of samples. Thus, we must consider the case with more large samples. In Case 2 we simulate the time series consisting of ten alternating samples: the odd samples consist of 50 random values from the uniform distribution U(0,1) and the even samples consist of 50 random values from U(0,1) plus 0.7. Now the width of sliding window is equal to 50 and the sliding step is equal to 50. As we see in Table 3 , all statistics demonstrate the stable results. All pairs of segments are recognized as heterogeneous. Therefore, as a practical recommendation we may state that the sample size must exceed 40. The new online algorithm implementing the Klyushin-Petunin test on streaming geolocation data is effective, sensitive, and robust. It does not depend on assumptions about distributions of samples and equally sensitive to difference between expected values (location hypothesis) and shapes of the distributions (scale hypothesis). Its significance level is less that 0.05. It does not require the special conditions for saving the data. The algorithm effectively solve the change-point and is more sensitive than alternative Kolmogorov-Smirnov when the size sample is small (less that 40). The experiments with samples from normal, lognormal, uniform, and gamma distribution with parameters describing different degrees of overlapping shows that p-statistics is more stable, sensitive and monotonic than Kolmogorov-Smirnov statistics. The Kolmogorov-Smirnov statistics demonstrates non-monotonic behavior due to the sensitiveness to outliers. The Kolmogorov-Smirnov statistics allow to detect the change-point in data stream, but the p-statistics has better accuracy and robustness, because it is very effective both for shifted samples drawn from the distributions with different means and for samples with large overlapping. The main feature of the p-statistics is the monotonicity depending on the degree of overlapping between the samples (the more overlapping the more p-statistics, and vice versa) and a sharp jump at the change-point. The future scope of the work is to apply the test to real data obtained during COVID-19 pandemic, decrease its computational complexity, and extend it to multivariate time series. Composite Monte Carlo decision making under high uncertainty of novel coronavirus epidemic using hybridized deep learning and fuzzy rule induction Diagnosis and prediction model for COVID19 patients response to treatment based on convolutional neural networks and whale optimization algorithm using CT image Finding an accurate early forecasting model from small dataset: a case of 2019-ncov novel coronavirus outbreak Day level forecasting for coronavirus disease (COVID-19) spread: analysis, modeling and recommendations Harmony-search and otsu based system for coronavirus disease (COVID-19) detection using lung CT scan images Testing for change points in time series Change detection in streaming data in the era of big data: models and issues Pruning and nonparametric multiple change point detection Sequential nonparametric tests for a change in distribution: an application to detecting radiological anomalies Optimal nonparametric change point detection and localization Estimate of difference between empirical distribution curves in two independent samples On the deviations of an empirical distribution curve A criterion for testing the hypothesis that two samples are from the same population On a test whether two samples ate from the same population A method of testing the hypothesis that two samples are from the same population Individual comparisons by ranking methods On a test of whether one of the random variables is stochastically larger than other A combinatorial test for the problem of two samples from continuous distributions Significance tests which may be applied to samples from any populations Consistency and unbiasedness of certain nonparametric tests Limit theorems associated with variants of the von Mises statistic Modified randomization tests for nonparametric hypotheses On a result be M. Rosenblatt concerning the Mises-Smirnov Test Ann Comment on "The spectral analysis of point processes Computers and unconventional test-statistics Finite sample properties and asymptotic efficiency of Monte Carlo tests Hypothesis testing using L1-distance bootstrap of Monographs on Statistics and Applied Probability Exact nonparametric two-sample homogeneity tests for possibly discrete distributions Posterior distribution of percentiles: bayes' theorem for sampling from a population I: Characterization of a uniform distribution using order statistics Construction of the bulk of general population in the case of exchangeable sample values I: A nonparametric test for the equivalence of populations based on a measure of proximity of samples Proximity measure between samples with repetition factor greater than one A(n) Assumption in machine learning I: Generalization of Bernoulli schemes that arise in order statistics I: Generalization of Bernoulli schemes that arise in order statistics II Some generalizations of Bernoulli and Polya-Eggenberger contagion models