key: cord-0045913-ieso8xi8
authors: Kubkowski, Mariusz; Łazȩcka, Małgorzata; Mielniczuk, Jan
title: Distributions of a General Reduced-Order Dependence Measure and Conditional Independence Testing
date: 2020-05-25
journal: Computational Science - ICCS 2020
DOI: 10.1007/978-3-030-50436-6_51
sha: d8bec3ce0f846fdaf984f56e2a57edbb6d5aad39
doc_id: 45913
cord_uid: ieso8xi8

We study distributions of a general reduced-order dependence measure and apply the results to conditional independence testing and feature selection. Experiments with Bayesian Networks indicate that using the introduced test in the Grow and Shrink algorithm instead of Conditional Mutual Information yields promising results for Markov Blanket discovery in terms of F measure.

remain largely unknown hindering study of associated selection methods. Here we show thatĴ β,γ (X, Y |X S ) exhibits dichotomous behaviour meaning that its distribution can be either normal or coincides with a distribution of a certain quadratic form in normal variables. The second case is studied in detail for binary Y . In particular for two popular criteria CIFE and JMI, conditions under which their distributions converge to distributions of quadratic form are made explicit. As two cases of dichotomy differ in behaviour of the variance ofĴ β,γ , its order of convergence is used to detect which case is actually valid. Then a parametric permutation test (i.e. a test based on permutations to estimate parameters of the chosen distribution) is used to check whether candidate variable X is independent of Y given X S .

We denote by p(x) := P (X = x), x ∈ X a probability mass function corresponding to X, where X is a domain of X and |X | is its cardinality. Joint probability will be denoted by p(x, y) = P (X = x, Y = y). Entropy for discrete random variable X is defined as

Entropy quantifies the uncertainty of observing random values of X. In case of discrete X, H(X) is non-negative and equals 0 when the probability mass is concentrated at one point. The above definition naturally extends to the case of random vectors (i.e. X can be multivariate random variable) by using multivariate probability instead of univariate probability. In the following we will frequently consider subvectors of X = (X 1 , . . . , X p ) which is a vector of all potential predictors of class index Y . The conditional entropy of X given Y is written as H(X|Y ) = y∈Y p(y)H(X|Y = y) (2) and the mutual information (MI) between X and Y is

This can be interpreted as the amount of uncertainty in X which is removed when Y is known which is consistent with an intuitive meaning of mutual information as the amount of information that one variable provides about another. MI equals zero if and only if X and Y are independent and thus it is able to discover nonlinear relationships. It is easily seen that

A natural extension of MI is conditional mutual information (CMI) defined as

which measures the conditional dependence between X and Y given Z. An important property is chain rule for MI which connects I((X 1 , X 2 ), Y ) to I(X 1 , Y ):

For more properties of the basic measures described above we refer to [3] . A quantity, used in next sections, is interaction information (II) [9] . The 3-way interaction information is defined as

which is consistent with an intuitive meaning of existence of interaction as a situation in which the effect of one variable on the class variable depends on the value of another variable.

We consider a discrete class variable Y and p discrete features X 1 , . . . , X p . Let X S denote a subset of features indexed by a subset S ⊆ {1, . . . , p}. We employ here greedy search for active features based on forward selection. Assume that S is a set of already chosen features, S c its complement and j ∈ S c a candidate feature. In each step we add a feature whose inclusion gives the most significant improvement of the mutual information, i.e. we find

The equality in (7) follows from (5) . Observe that (7) indicates that we select a feature that achieves the maximum association with the class given the already chosen features. For example, first-order approximation yields I(X j , Y ), which is a simple univariate filter MIM, frequently used as a pre-processing step in high-dimensional data analysis. However, this method suffers from many drawbacks as it does not take into account possible interactions between features and redundancy of some features. When the second order approximation is used, the dependence score for candidate feature is

The second equality uses (6) . In literature (8) is known as CIFE (Conditional Infomax Feature Extraction) [7] criterion. Observe that in (8) we take into account not only relevance of the candidate feature, but also its possible interactions with the already selected features. However, frequently it is useful to scale down the corresponding term [2] . Among such modifications the most popular is JMI

where the second equality follows from (5) . JMI was also proved to be an approximation of CMI under certain dependence assumptions [13] . Data-adaptive version of JMI will be considered in Sect. 4. In [2] it is proposed to consider a general information-theoretic dependence measure

where β, γ are some positive constants usually depending in decreasing manner on the size |S| = k of set S. Several frequently used selection criteria are special cases of (9). MrMR criterion [11] corresponds to (β, γ) = (|S| −1 , 0) whereas more general MIFS (Mutual Information Feature Selection) criterion [1] corresponds to pair (β, 0). Obviously, the simplest criterion MIM corresponds to (0, 0) pair. CIFE defined above in (8) is obtained for (1, 1) pair, whereas (β, γ) = (1/|S|, 1/|S|) leads to JMI. In the following we consider asymptotic distributions of the sample version of J β,γ (X j ), namelŷ

and show how the distribution depends on underlying parameters. In this way we gain a more clear idea what is an influence of β and γ on the behaviour of J β,γ . Sample version in (10) is obtained by plugging in fractions of observations instead of probabilities in (3) and (4).

In the following we will state our theoretical results which study asymptotic distributions ofĴ β,γ (X, Y |Z) where Z = (Z 1 , . . . , Z |S| ) is possible multivariate discrete vector and then we apply it to previously introduced framework by putting X := X j and Z := (X 1 , . . . , X |S| ). We will show that its distribution is either approximately normal or, if the asymptotic variance vanishes, is approximately equal to distribution of quadratic form of normal variables. Let p = (p(x, y, z)) x,y,z be a vector of probabilities for (X, Y, Z) and we assume whence forth that p(x, y, z) > 0 for any triple of (x, y, z) values in the range of (X, Y, Z). Moreover, f (p) equals J β,γ (X, Y |Z) treated as a function of p, Df denotes a derivative of function f and d − → convergence in distribution. The special case of the result below for CIFE criterion has been proved in [6] .

where σ 2

After some calculations one obtains that ∂f (p)

Letp

. The remaining part of the proof relies on Taylor's formula for f (p)−f (p). Details are given in supplemental material [5] .

We characterize the case when σ 2 J = 0 in more detail for binary Y and β = γ = 0 which encompasses CIFE and JMI criteria. Note that binary Y case covers an important situation of distinguishing between cases (Y = 1) and control (Y = 0). We define two scenarios:

-Scenario 1 (S1): X ⊥ Y |Z s for any s ∈ S and X ⊥ Y (X ⊥ Y |Z denotes conditional independence of X and Y given Z).

We will study in detail the case when σ 2 J = 0 and either β = γ = 0 or at least one of the parameters β, γ equal 0. We note that all cases of used information-based criteria fall in one of these categories [2] . We have Theorem 2. Assume that σ 2 J = 0 and β = γ = 0. Then we have:

. . , |S| − 1} then one of the above scenarios holds with W defined in (14) .

The analogous result can be stated for the case when at least one of the parameters β or γ equals 0 (details are given in supplement [5] ).

We state below corollary for criterion JMI. Note that in view of Theorem 2 Scenario 2 holds for JMI. Let

where V and H are defined in Theorem 1. Moreover in this case Scenario 1 holds.

Note that σ 2 JMI = 0 implies JMI = 0 as in this case Scenario 1 holds. The result for CIFE is analogous (see supplemental material [5] ).

In both cases we can infer the type of limiting distribution if the corresponding theoretical value of the statistic is nonzero. Namely, if JMI = 0 (CIF E = 0) then σ 2 JMI = 0 (respectively, σ 2 CIF E = 0) and the limiting distribution is normal. Checking that JMI = 0 is simpler than CIF E = 0 as it is implied by X ⊥ Y |Z s for at least one s ∈ S. Actually, JMI = 0 is equivalent to conditional independence of X and Y given Z s for any s ∈ S which in its turn is equivalent to σ 2 JMI = 0. In the next section we will use a behaviour of the variance to decide which distribution to use as a benchmark for testing conditional independence. In a nutshell, the corresponding switch which is constructed in data-adaptive way and is based on different order of convergence of the variance to 0 in both cases. This is exemplified in the Fig. 1 which shows boxplots of the empirical variance of JMI multiplied by sample size in two cases, when the theoretical variance is 0 (model M2 discussed below) or not (model M1). The Figure clearly indicates that the switch can be based on the behaviour of the variance.

In the following we useĴ = JMI as a test statistic for testing conditional independence hypothesis where X S denotes set of X i with i ∈ S. A standard way of testing it is to use Conditional Mutual Information (CMI) as a test statistic and its asymptotic distribution to construct the rejection region. However, it is widely known that such test loses power when the size of conditioning set grows due to inadequate estimation of p(x, y|X S = x S ) for all strata {X S = x S }. Here we use as a test statisticĴ which does not suffer from this drawback as it involves conditional probabilities given univariate strata {X s = x s } for s ∈ S. As behaviour ofĴ is dichotomous on (16) we consider a data-dependent way of determining which of the two distributions: normal or distribution of quadratic form (abbreviated to d.q.f. further on) is closer to distribution ofĴ. Here we propose a switch based on the connection between distribution of the statistics and its variance (see Theorem 1). We consider the test based on JMI as in this case σ 2 JMI = 0 is equivalent to JMI = 0. Namely, it is seen from Theorem 1 that normality of asymptotic distribution corresponds to the case when the asymptotic variance calculated for samples of size n and n/2 should be approximately the same and should be strictly smaller for a larger sample otherwise. For each strata X S = x S we permute corresponding n XS values of X B times and for each permutation we obtain value of JMI as well as an estimator of its asymptotic variance v n . The permutation scheme is repeated for randomly chosen subsamples of original sample of size n/2 and B values of v n/2 are calculated. We than compare the mean of v n with the mean of v n/2 using t-test. If the equality of the means is not rejected we bet on normality of asymptotic distribution, in the opposite case d.q.f. is chosen. Note that permuting samples for a given value X S = x S we generate data (X perm , Y, X S ) which follows null hypothesis (16) while keeping the distribution P X|XS = P Xperm|XS unchanged. In Fig. 2 we show that when conditional independence hypothesis is satisfied then distribution of estimated varianceσ 2 JMI based on permuted samples follows closely distribution ofσ 2 JMI based on independent samples. Thus indeed using permutation scheme described above we can approximate the distribution of the variance of JMI under H 0 for a fixed conditional distributionσ 2 JMI . Now we approximate sample distribution of JMI by N (μ,σ 2 ) when normal distribution has been picked or when d.q.f. has been picked approximation is χ 2 μ (withμ being the empirical mean of JMI) or scaled chi squareαχ 2 d +β where parameters are based on three first empirical moments of the permuted samples [15] . Then the observed value JMI is compared to quantile of the above benchmark distribution and conditional independence is rejected when this quantile is exceeded. Note that as parametric permutation test is employed we need much smaller B than in the case of non-parametric permutation test and we use B = 50. Algorithm will be denoted by JMI(norm/chi) or JMI(norm/chi scale) depending on whether chi square or scaled chi square is considered in the switch. The pseudocode of the algorithm is given below in Algorithm 1 and the code itself is available in [5] . For comparison we consider two tests: asymptotic test for CMI (called CMI) and semi-parametric permutation test (called CMI(sp)) proposed in [12] . In CMI(sp) the permutation test is used to estimate the number of degrees of freedom of reference chi square distribution. 

: Training data D0 = (X, Y, Z) of size n (Z with p columns), number of permutations B. Let:

Compute:

CRITi(X, Y, Z)

Randomly permute X (on each strata on Z) to obtain permuted sample

Compute:

Randomly permute X (on each strata on Z) and randomly choose [n/2] observations to obtain permuted sample D

Let: T (·, ·) two sample t-test statistic pT (·, ·) p-value of the two sample t-test statistic F N (μ,σ) (s) theoretical distribution function of N (μ,σ) 

: p-value p

We investigate the behaviour of the proposed test in two generative tree models shown in the left and the right panel of Fig. 3 which will be called M1 and M2. Note that in model M1 X stronger condition X

. . , X k ) holds. We consider the performance of JMI based test for testing hypothesis H 01 : X

. . , X k when the sample size and parameters of the model vary. As H 01 is satisfied in both models this contributes to the analysis of the size of the test.

Observations in M1 are generated as follows: first, Y is chosen from Bernoulli distribution with success probability P (Y = 1) = 0.5. Then (Z 1 , . . . , Z k ) are generated from N (0, Σ) given Y = 0 and N (γ, Σ) given Y = 1, where elements of Σ are equal σ ij = ρ |i−j| and γ = (1, γ, . . . , γ k−1 ) T with 0 ≤ ρ < 1 and 0 < γ ≤ 1 some chosen values. Then Z values are discretised to two values (0 and 1) to obtain X 1 , . . . X k . In the next step Z (1) 1 is generated from conditional distribution N (X 1 , 1) given X 1 and then Z (1) 1 is discretised to X (1) 1 . We note that such method of generation yields that Z (1) 1 and Y are conditionally independent given X 1 and the same is true for X (1) 1 . Observations in M2 are generated similarly, the only difference being that Z (1) 1 is now generated independently of (Y, X 1 , . . . , X k ). We will also check the power of the tests in M1 for testing hypotheses H 02 : X (1) 1 ⊥ Y |X 2 , . . . , X k and H 03 : X 1 ⊥ Y |X 2 , . . . , X k as neither of them is satisfied in M1. Note however, that since information I(X (1) 1 , Y |X 2 , . . . , X k ) and I(X 1 , Y |X 2 , . . . , X k ) decreases when k (or γ, ρ) increases the task becomes more challenging for larger k (or γ, ρ, respectively) which will result in a loss of power for large k when sample size is fixed.

Estimated tests sizes and powers are based on N = 200 times repeated simulations.

We first check how the switch behaves for JMI test while testing H 01 (see Fig. 4 ). In M1 for k = 1 as X (1) 1 ⊥ Y given X 1 and JMI = I(X (1) 1 , Y |X 1 ) = 0 asymptotic distribution is d.q.f. and we expect switching to d.q.f. which indeed happens in almost 100%. For k ≥ 2, JMI = 0 asymptotic distribution is normal which is reflected by the fact that the normal distribution is chosen with large probability. Note that this probability increases with n as summandŝ I(X (1) 1 , Y |X i ) of JMI for i ≥ 2 converge to normal distributions due to Central Limit Theorem. The situation is even more clear-cut for M2 where JMI = 0 for all k and the switch should choose d.q.f. Figure 5 shows the empirical sizes of the test when theoretical size has been fixed at α = 0.05 and ρ = 0 and γ = 1. We see that empirical size is controlled fairly well for CMI(sp) and for the proposed methods, with the switch (norm/chi scale) working better than the switch (norm/chi). A superiority of the former is even more pronounced for 0 < γ < 1 and when X 1 , . . . , X k are dependent (not shown). Note erratic behaviour of size for CMI, which significantly exceeds 0.1 for certain k and then drops to 0. Figures 6 and 7 show the power of the considered methods for hypotheses H 02 and H 03 . It is seen that for γ = 1, ρ = 0 the expected decrease of power with respect to k is much more moderate for the proposed methods than for CMI and CMI(sp). JMI (norm/chi scale) works in most cases slightly better than JMI (norm/chi). For H 03 power of CMI(sp) is similar to that of CMI but it exceeds it for large k, however, it is significantly smaller than the power of both proposed methods. For H 03 superiority of JMI-based tests is visible only for large k when n is moderate (n = 500, 1000), whereas for H 02 it is also evident for small k. With changing ρ and γ superiority of the proposed methods is still evident (see Fig. 7 ). Note that for fixed γ the power of all methods decreases when ρ increases.

Finally, we illustrate how the proposed test can be applied for Markov Blanket (MB, see e.g. [10] ) discovery of Bayesian Networks (BN). MB for a target Y is defined as the minimal set of predictors given which Y and remaining predictors are conditionally independent [2] . We have used the JMI test (with normal/scaled chi square switch) in the Grow and Shrink (GS, see e.g. [8] ) algorithm for MB discovery and compared it with GS using CMI and CMi(sp). GS algorithm finds a large set of potentially active features in the Grow phase and then whittles it down in the Shrink phase. In the real data experiments we used another estimator of σ 2 equal to the empirical variance of JMIs calculated for permuted samples which behaved more robustly. The results were evaluated by F measure (the harmonic mean of precision and recall). We have considered several benchmark BNs from BN repository https://www.bnlearn.com/bnrepository (asia, cancer, child, earthquake, sachs, survey). For each of them Y has been chosen as the variable having the largest MB. The results are given in Table 1 . It is seen that with respect to F in the majority of cases GS-JMI method is the winner and ties with one of the other methods and the more detailed analysis indicates that this is due to its largest recall in comparison with GS-CMI and GS-CMI(sp) (see supplement [5] ). This agrees with our initial motivation of considering such method which was the lack of power (i.e. missing important variables) by CMIbased tests. 

We have proposed a new test of conditional independence based on approximation JMI of the conditional mutual information CMI and its asymptotic distributions. We have shown using synthetic data that the introduced test is more powerful than tests based on asymptotic or permutation distributions of CMI when a conditioning set is large. In our analysis of real data sets we have indicated that the proposed test used in GS algorithm yields promising results in MB discovery problem. Drawback of such a test is that it disregards interactions between predictors and target variables of order higher than 3. Further research topics include systematic study ofĴ β,γ and especially how its parameters influence the power of the associated tests and feature selection procedures. Moreover, studying tests based on extended JMI including higher order terms is worthwhile.

Using mutual information for selecting features in supervised neural-net learning

Conditional likelihood maximisation: a unifying framework for information theoretic feature selection

Series in Telecommunications and Signal Processing

An introduction to feature extraction

Distributions of a general reducedorder dependence measure and conditional independence testing: supplemental material

How to gain on power: novel conditional independence tests based on short expansion of conditional mutual information

Conditional infomax learning: an integrated framework for feature extraction and fusion

Bayesian network induction via local neighborhoods

Multivariate information transmission

Towards scalable and data efficient learning of Markov boundaries

Feature selection based on mutual information criteria of max-dependency, max-relevance and min-redundancy

Permutation testing improves bayesian network learning

A review of feature selection methods based on mutual information

Data visualization and feature selection: new algorithms for nongaussian data

Approximate and asymptotic distributions of chi-squared type mixtures with applications