key: cord-0044847-duwcljks authors: Hu, Chenyi; Hu, Zhihui H. title: On Statistics, Probability, and Entropy of Interval-Valued Datasets date: 2020-05-16 journal: Information Processing and Management of Uncertainty in Knowledge-Based Systems DOI: 10.1007/978-3-030-50153-2_31 sha: 53256dd909bbd09762fb670697ac8280bc6aa286 doc_id: 44847 cord_uid: duwcljks Applying interval-valued data and methods, researchers have made solid accomplishments in information processing and uncertainty management. Although interval-valued statistics and probability are available for interval-valued data, current inferential decision making schemes rely on point-valued statistic and probabilistic measures mostly. To enable direct applications of these point-valued schemes on interval-valued datasets, we present point-valued variational statistics, probability, and entropy for interval-valued datasets. Related algorithms are reported with illustrative examples. Statistic and probabilistic measures play a very important role in processing data and managing uncertainty. In the literature, these measures are mostly point-valued and applied to point-valued dataset. While a point-valued datum intends to record a snapshot of an event instantaneously in theory, it is often imprecise in real world due to system and random errors. Applying intervalvalued data to encapsulate variations and uncertainty, researchers have developed interval methods for knowledge processing. With data aggregation strategies [1, 5, 21] , and others, we are able to reduce large size point-valued data into smaller interval-valued ones for efficient data management and processing. By doing so, researchers are able to focus more on qualitative properties and ignore insignificant quantitative differences. Studying interval-valued data, Gioia and Lauro developed interval-valued statistics [4] in 2005. Lodwick and Jamison discussed interval-valued probability [17] in the analysis of problems containing a mixture of possibilistic, probabilistic, and interval uncertainty in 2008. Billard and Diday reported regression analysis of interval-valued data in [2] . Huynh et al. established a justification on decision making under interval uncertainty [13] . Works on applications of interval-valued data in knowledge processing include [3, 8, 16, 19, 20, 22] , and many more. Applying interval-valued data in the stock market forecasting, Hu and He initially reported an astonishing quality improvements in [9] . Specifically, comparing against the commonly used point-valued confidence interval predictions, the interval approaches have increased the average accuracy ratio of annual stock market forecasts from 12.6% to 64.19%, and reduced the absolute mean error from 72.35% to 5.17% [9] . Additional results on the stock market forecasts reported in [6, 7, 10] , and others have verified the advantages of using interval-valued data. The paper [12] , published in the same volume as this one, further validates the advantages from the perspective of information theory. Using interval-valued data can significantly improve efficiency and effectiveness in information processing and uncertainty management. Therefore, we need to study interval-valued datasets. As a matter of fact, powerful inferential decision making schemes in the current literature use point-valued statistic and probabilistic measures, not intervalvalued ones [4] and [17] , mostly. To enable direct applications of these schemes and theory on analyzing interval-valued datasets, we need to supply point-valued statistics and probability for interval-valued datasets. Therefore, the primary objective of this work is to establish and to calculate such point-valued measures for interval-valued datasets. To make this paper easy to read, it includes brief introductions on necessary background information. It also provides easy to follow illustrative examples for novel concepts and algorithms in addition to pseudo-code. Numerical results of these examples are obtained with a recent version of Python 3. However, readers may use any preferred general purpose programming language to verify the results. Prior to our discussion, let us first clarify some basic concepts and notations related to intervals in this paper. An interval is a connected subset of R. We denote an interval-valued object with a boldfaced letter to distinguish it from a point-valued one. We further specify the greatest lower bound and least upper bound of an interval object with an underline and an overline of the same letter but not boldfaced, respectively. For example, while a is a real, the boldfaced letter a denotes an interval with its greatest lower bound a, and least upper bound a. That is a = {a : a ≤ a ≤ a, a ∈ R} = [a, a]. The absolute value of a, defined as |a| = a − a, is also called the length (or norm) of a. This is the greatest distance between any two numbers in a. The midpoint and radius of an interval a are defined as mid(a) = a + a 2 and rad(a) = a − a 2 , respectively. Because the midpoint and radius of an interval a are point-valued, we simply denote them as mid(a) and rad(a) without boldfacing the letter a. We call [a, a] the endpoint (or min-max) representation of a. We can specify an interval a with mid(a) and rad(a) too. This is because of a = mid(a) − rad(a) and a = mid(a) + rad(a). In the rest of this paper, we use both min-max and mid-rad representations for an interval-valued object. While we use a boldfaced lowercase letter to indicate an interval, we denote an interval-valued dataset, i.e., a collection of real intervals, with a boldfaced uppercase letter. For instance, X = {x 1 , x 2 , . . . , x n } is an interval-valued dataset. The sets X = {x 1 , x 2 , . . . , x n } and X = {x 1 , x 2 , . . . , x n } are the left-and rightend sets of X, respectively. Although items in a set are not ordered, the x i ∈ X and x i ∈ X are related to the same interval x i ∈ X. For convenience, we denote both X and X as ordered tuples. They are the left-and right-endpoints of X. That is X = (x 1 , x 2 , . . . , x n ) and X = (x 1 , x 2 , . . . , x n ). Similarly, the midpoint and radius of X are point-valued tuples. They are mid(X) = (mid(x 1 ), mid(x 2 ), . . . , mid(x n )) and rad(X) = (rad(x 1 ), rad(x 2 ), . . . , rad(x n )) , respectively. [4, 6] }. Then, its left-endpoint is X 0 = (1, 1.5, 2, 3, 2.5, 4), and rightendpoint is X 0 = (5, 3.5, 3, 7, 6). The midpoint of X 0 is mid(X 0 ) = X 0 + X 0 2 = (3, 2.5, 2.5, 4.75, 5), and the radius is rad(X 0 ) = X 0 − X 0 2 = (2, 1, 0.5, 2.25, 1). We use this sample dataset X 0 in the rest of this paper to illustrate concepts and algorithms for its simplicity. In the rest of this paper, we discuss statistics of an interval-valued dataset in Sect. 2; define point-valued probability distributions for an interval-valued dataset in Sect. 3; introduce point-valued information entropy in Sect. 4; and summarize the main results and future work in Sect. 5. We introduce positional statistics for an interval-valued dataset first, and then discuss its point-valued variance and standard deviation. The left-, right-endpoints, midpoint, and radius X, X, mid(X), and rad(X) are among positional statistics of an interval-valued dataset X as presented in Example 1. The mean of X, denoted as μ X , is the arithmetic average of X. Because We now define few more observational statistics for X. 1. The envelope of X is the interval env(X) = min(X), max(X) ; In other words, ∀x i ∈ X, x i is a subset of env(X), and core(X) is a subset of x i . Furthermore, mode(X) is an ordered tuple. In which, s∈Sj x s is the non-empty intersection of x s for all s ∈ S j , such that, the cardinality of S j is the greatest. For a given X, its mode may not be unique. This is because of that, there may be multiple cardinality k subsets of {1, 2, . . . , n} satisfying the nonempty intersection requirement s∈Sj x s = ∅. Corollary 1 is straightforward. Instead of providing a proof, we provide the mean, envelop, core and mode for the sample dataset X 0 = { [1, 5] , [1.5, 3.5] , [2, 3] , [2.5, 7] , [4, 6] }. In addition to its endpoints, midpoint, and radius presented in Example 1, we have its mean μ X0 = [2.2, 4.9]; env(X 0 ) = [1, 7] ; core(X 0 ) = ∅ because of max(X 0 ) = 4 is greater than min(X 0 ) = 3; and mode(X 0 ) = ([2.5, 3], 4). Figure 1 illustrates the sample dataset X 0 . From which, one may visualize the env(X 0 ) and mode(X 0 ) by imaging a vertical line, like the y-axis, continuously moving from left to right. The first and last points the line touches any x i ∈ X 0 determine the envelop env(X 0 ) = [1, 7] . The line touches at most four intervals for all x i ∈ X 0 between [2.5, 3] . Hence, the mode is mode(X 0 ) = ([2.5, 3], 4). While finding the envelop, core, and mean of X is straightforward, determining the mode of X involves the 2n numbers in X and X, which divide env(X) into 2n − 1 sub-intervals in general (though some of them maybe degenerated as points.) Each of these 2n − 1 sub-intervals can be a candidate of the nonempty intersection part in the mode. For any x i ∈ X, it may cover some of these 2n − 1 sub-intervals (candidates) consecutively. For each of these candidates, we accumulate its occurrences in each x i ∈ X. The mode(s) for X is (are) the candidate(s) with the (same) highest occurrence. As a special case, if core(X) is not empty, then mode(X) = (core(X), n). We summarize the above as an algorithm. Input: X: an n-element interval dataset. Output: mode(X). . This is because of that for each interval x i , it may update the count in each of the 2n − 1 candidates takes O(n 2 ). In the literature, the variance of a point-valued dataset X is defined as in which, the term |x i − μ| is the distance between x i ∈ X and μ, which is the mean of X. Using (2) to define a variance for an interval-valued X, we need a notion of point-valued distance between two intervals, x i ∈ X and the interval μ X . May we simply use |a − b|, the absolute value of the difference between two intervals a and b, as their distance? Unfortunately, it does not work. In interval arithmetic [18] , the difference between two intervals a and b is defined as the follow: Equation (3) ensures ∀a ∈ a, ∀b ∈ b, a − b ∈ a − b. However, it also implies |a − b| = max{|a − b|, ∀a ∈ a, ∀b ∈ b}, which is the maximum distance between a ∈ a and b ∈ b. Mathematically, a distance between two nonempty sets A and B is usually defined as the minimum distance between a ∈ A and b ∈ B but not the maximum. Hence, we need to define a notion of distance between two intervals. . Definition 2 is in fact an extension of the distance between two reals. This is because of that the radius of a real is zero and the midpoint of a real is itself always. Replacing (4), we have the point-valued variance of X as the follow: The expression above has three terms. All of them involve mid(μ X ) and rad(μ X ). mid(x i ) = μ mid(X) . Therefore, the first term in the expression above 1 n V ar(mid(X)) according to (2) . Similarly, the second term 1 n The third term is related to the absolute covariance between mid(X) and rad(X). Summarizing the discussion above, we have the point-valued variance for an interval-valued dataset X as the follow. Because midpoints and radii of interval-valued objects are point-valued, the variance defined in (5) is also point-valued. Hence, we have the point-valued standard deviation of X as usual: In evaluating (5) and (6), one does not need interval computing at all. For the sample dataset X 0 , we have its point-valued variance V ar(X 0 ) = var(mid(X 0 ))+var(rad(X 0 ))+ 2 5 It is worthwhile to note that, Eq. (5) is an extension of (2) and applicable to point-valued datasets too. This is because of that, for all x i in a point-valued X, rad(x i ) = 0 and mid(x i ) = x i always. Hence, V ar(X) = V ar(mid(X)) for a point-valued X. An interval-valued dataset X can be viewed as a sample of an interval-valued population. In this section, we study practical ways to find probability distributions for an interval-valued dataset X. Our discussion addresses two different cases. One assumes distribution information for all x i ∈ X. The other does not. Our discussion involves the concept of a probability distribution over an interval. Let us very briefly review the literature first. A function f (x) is a probability density function (pdf ) of a random variable and beta distribution: both parameters α and β are positive, and Γ (t) is the gamma function. There are software tools available to fit point-valued sample data, which means computationally determining the parameter values in a chosen type of distribution. For instance, the Python scipy.stats module is available to find the optimal μ and σ to fit a point-valued dataset in a normal distribution, and/or α and β in a beta distribution. It is safe to assume an availability of a pdf for each x i ∈ X both theoretically and computationally. In practice, an interval x i ∈ X is often obtained through aggregating observed points. For instances, in [9] and [11] , min-max and confidence intervals are applied to aggregate points into intervals, respectively. If an interval is provided directly, one can always pick points from the interval and fit these points with a selected probability distribution computationally. Hereafter, we denote the pdf of x i ∈ X as pdf i (x). We now define a notion of pdf for an interval-valued dataset X. The theorem below provides a practical way to calculate a pdf for X. is a pdf of X. Equation (8) actually provides a practical way of calculating the pdf of X. Provided pdf i (x) for each x i ∈ X, we have the algorithm in pseudo-code below: Input: an n-item interval-valued dataset X; pdfi(x) for every xi ∈ X Output: pdf (X) # Initialization: Concatenating x and x as a list c Sort c For i from 1 to 2n − 1: segmenti = (ci, ci+1, 0) End for # Accumulating pdf on each segment: For each xi ∈ X find the j and k, such that cj = x i and c k = xi For l from j to k: segment l .pdf + = pdfi End for End for # Calculating the pdf: For i from 0 to 2n − 1: segmenti.pdf / = n End for Return segmenti for all i ∈ {1, 2, . . . , 2n − 1} Example 2. Find a pdf from the sample dataset X 0 = { [1, 5] , [1.5, 3.5] , [2, 3] , [2.5, 7] , [4, 6] }. For simplicity, we assume a uniform distribution for each pdf i 's, i.e., Applying Algorithm 2, we have The pdf in the example is a stair function. This is because the uniform distribution assumption on each x i ∈ X. Here are few additional notes on finding a pdf for X with Algorithm 2 . If assuming uniform distribution, how do we handle the case if ∃i such that x i = x i ? First of all, an interval element x i is usually not degenerated as a constant. Even there is an i such that x i = x i , we can always assign an arbitrary non-negative pdf value at that point. This does not impact the calculation of probability in integrating the pdf function. Algorithm 2 assumes pdf i (x) = 0, ∀x ∈ x i . If it is not the case, the 2n numbers in X and X divide R in 2n + 1 sub-intervals. They are (−∞, min(X)), (max(X), ∞) together with the 2n − 1 sub-intervals in env(x). Therefore, the accumulation loop in Algorithm 2 should run through all of the 2n + 1 sub-intervals, and then normalize them by dividing n. Another implicit assumption of Theorem 1 is that, all x i ∈ X are equally weighted. However, that is not necessary. If needed, one may place a positive weight w i on each of pdf i 's as stated in the Corollary 2. Corollary 2. Let X = (x 1 , x 2 , . . . , x n ) be an interval-valued dataset and pdf i be the pdf of x i ∈ X, then the function is a pdf of X. A proof of Corollary 2 is straightforward too. We have successfully applied the Corollary in computationally studying the stock market [12] . It is not necessary to assume the probability distribution for all x i ∈ X to find a pdf of X. An interval x is determined by its midpoint and radius. Let u = mid(x) and v = rad(x) be two point-valued random variables. Then, the pdf of x is a If we assume a normal distribution for au + bv, then f (u, v) is a bivariate normal distribution [25] . The pdf of a bivariate normal distribution is: and ρ is the normalized correlation between u and v, i.e., the ratio of their covariance and the product of σ u and σ v . Applying the pdf, we are able to estimate the probability over a region u = [u 1 , To calculate the probability of an interval x, whose midpoint and radius are u 0 and v 0 , we need a marginal pdf for either u or v. If we fix u = u 0 , then the marginal pdf of v follows a single variable normal distribution. Thus, and the probability of x is An interval-valued dataset X provides us its mid(X) and rad(X). They are point-valued sample sets of u and v, respectively. All of μ mid(X) , μ rad(X) , σ mid(X) , and σ rad(X) can be calculated as usual to estimate the μ u , μ v , σ u , and σ v in (11) . For instance, from the sample X 0 , we have μ mid(X0) = 3.55, μ rad(X0) = 1.35, σ mid(X0) = 1.1, σ rad(X0) = 0.66, and ρ = 0.404, respectively. Furthermore, using μ rad(X0) = 1.35 and σ rad(X0) = 0.66 in (13) , we can estimate the probability of an arbitrary interval x with (14) . So far, we have established practical ways to calculate point-valued variance, standard deviation, and probability distribution for an interval-valued dataset X. With them, we are able to directly apply commonly available inferential decision making schemes based on interval-valued dataset. While it is out of the scope of this paper to discuss specific applications of inferential statistics on an interval-valued dataset, we are interested in measuring the amount of information in an interval-valued dataset. Information entropy is the average rate at which information is produced by a stochastic source of data [24] . Shannon introduced the concept of entropy in his seminal paper "A Mathematical Theory of Communication" [23] . The measure of information entropy associated with each possible data value is: where p(x i ) is the probability of x i ∈ X. An interval-valued dataset X = (x 1 , x 2 , . . . , x n ) divides the real axis into 2n + 1 sub-intervals. Using P to denote the partition and x (j) to specify its j-th element, we have P = x (1) , x (2) , . . . , x (2n+1) . As illustrated in Example 2, we can apply Algorithm 2 to find the pdf j for each x (j) ∈ P. Then, the probability of x (j) = x (j) pdf j (t)dt is available. Hence, we can apply (15) to calculate the entropy of an interval-valued dataset X. For reader's convenience, we summarize the steps of finding the entropy of X as an algorithm below. Input: an n-item interval dataset X pdfi for all xi ∈ X Output: Entropy(X) # Find the partition for the real axis: Concatenating x and x as a list c Sort c The c forms a 2n + 1 partition P of (−∞, ∞) # Find the probability for each x (j) ∈ P: For j from 1 to 2n + 1 Find a pdfj on x (j) with Algorithm 2 End for # Calculate the entropy: Entropy(X) = 0 For j from 1 to 2n + 1 Entropy(X) − = pj log pj End for Return Entropy(X) The example below finds the entropy of the sample dataset X 0 with the same assumption of uniform distribution in Example 2. The entropy of X 0 is Entropy(X 0 ) = − i p i log p i = 2.019. Algorithm 3 provides us a much needed tool in studying point-valued information entropy of an interval-valued dataset. Applying it, we have investigated entropies of the real world financial dataset, which has used in the study of stock market forecasts [6, 7] , and [9] , from the perspective of information theory. The results are reported in [12] . It not only reveals the deep reason of the significant quality improvements reported before, but also validates the concepts and algorithms presented here in this paper as a successful application. Recent advances have shown that using interval-valued data can significantly improve the quality and efficiency of information processing and uncertainty management. For interval-valued datasets, this work establishes much needed concepts of point-valued variational statistics, probability, and entropy for interval-valued datasets. Furthermore, this paper contains practical algorithms to find these point-valued measures. It provides additional theoretic foundations of applying point-valued methods in analyzing interval-valued datasets. These point-valued measures enable us to directly apply currently available powerful point-valued statistic, probabilistic, theoretic results to interval-valued datasets. Applying these measures in various applications is definitely among a high priority of our future work. In fact, using this work as the theoretic foundation, we have successfully analyzed the entropies of the real world financial dataset related to the stock market forecasting mentioned in the introduction of this paper. The obtained results are reported in [12] and published in the same volume as this one. On a theoretic side, future work includes extending the concepts in this paper from single dimensional to multi-dimensional intervalvalued datasets. New types of aggregation functions for interval-valued fuzzy setting and preservation of pos-B and nec-B-transitivity in decision making problems Regression analysis for interval-valued data Uncertainty measurement for interval-valued information systems Basic statistical methods for interval data Midpoint method and accuracy of variability forecasting Impacts of interval computing on stock market forecasting Knowledge Processing with Interval and Soft Computing An application of interval methods to stock market forecasting Using interval function approximation to estimate uncertainty A note on probabilistic confidence of the stock market ILS interval forecasts A computational study on the entropy of interval-valued datasets from the stock market On decision making under interval uncertainty: a new justification of Hurwicz optimism-pessimism approach and its use in group decision making Generating and applying rules for interval valued fuzzy observations Interval-valued probability in the analysis of problems containing a mixture of possibilistic, probabilistic, and interval uncertainty Methods and Applications of Interval Analysis Bandwidth variability prediction with rolling interval least squares (RILS) Interval-valued centroids in K-means algorithms Uncertainty Data in Interval-Valued Fuzzy Set Theory: Properties, Algorithms and Applications, 1st edn An interval-radial algorithm for hierarchical clustering analysis A mathematical theory of communication Binary normal distribution