key: cord-0519715-7wtkhbjr
authors: Wu, Genqiang; Xia, Xianyao; He, Yeping
title: Information Theory of Data Privacy
date: 2017-03-22
journal: nan
DOI: nan
sha: c7e82460b63f919d6d05bfd4c0bf2106163727d4
doc_id: 519715
cord_uid: 7wtkhbjr

By combining Shannon's cryptography model with an assumption to the lower bound of adversaries' uncertainty to the queried dataset, we develop a secure Bayesian inference-based privacy model and then in some extent answer Dwork et al.'s question [1]:"why Bayesian risk factors are the right measure for privacy loss". This model ensures an adversary can only obtain little information of each individual from the model's output if the adversary's uncertainty to the queried dataset is larger than the lower bound. Importantly, the assumption to the lower bound almost always holds, especially for big datasets. Furthermore, this model is flexible enough to balance privacy and utility: by using four parameters to characterize the assumption, there are many approaches to balance privacy and utility and to discuss the group privacy and the composition privacy properties of this model.

Data privacy protection [2, 3, 4] studies how to query dataset while preserving the privacy of individuals whose sensitive information is contained in the dataset. The crux of this field is to find suitable privacy protection model which can provide better tradeoffs between privacy protection and data utility. Differential privacy model [5, 6] is currently the most important and popular privacy protection model. Dwork [2] illustrated differential privacy as "differential privacy will ensure that the ability of an adversary to inflict harm (or good, for that matter)of any sort, to any set of peopleshould be essentially the same, independent of whether any individual opts in to, or opts out of, the dataset." This illustration can be explained as that differential privacy minimizes the increased risk to an individual's privacy incurred by joining (or leaving) the dataset of the individual. This implies that differential privacy seldom cares about the increased risk to the individual's privacy incurred by joining (or leaving) the dataset of other individuals, which is unreasonable since other individuals' data may also be related to the individual's privacy. 3 For example, to a dataset containing individuals' genomic data [7, 8] , the joining of an old man's 100 descendants clearly increases the risk to the man's privacy. In this paper, we will formally analyze the influence of the other individuals' joining the dataset to an individual's privacy and propose our solution.

Our powerful tool to analyze the influence is derived from Shannon's perfect secrecy [9] , whose computational complexity relaxation is the famous semantic security [10] , one fundamental concept in cryptography. Specifically, the perfect secrecy ensures that outputs (or ciphertexts) of a crypto system contain no information about inputs (or plaintexts), i.e., no information about the inputs can be extracted by any adversary, and the semantic security implies that any information revealed cannot be extracted by the probabilistic polynomial time (PPT) adversaries [10, 11] [12, p. 476 ]. To discuss the privacy problems more precisely, let us first review Shannon's theory to cryptography.

In Shannon's theory [9, 11, 10 ], a cryptography model/system is defined as a set of (probabilistic) transformations of the plaintext universe into the ciphertext universe. Each particular transformation of the set corresponds to enciphering with a particular key. The transformations are supposed reversible so that unique deciphering is possible when the key is known [10, 13] . For a plaintext X and a secret key K, let Y be the corresponding ciphertext. Consider X, K, Y as random variables, where the probability distributions of X, K, Y are the adversary's probabilities for the choices in question, and represent his knowledge of the situation. Then the mutual information I(X; Y ) [14] or the max-mutual information I ∞ (X; Y ), defined in Definition 3, will be a measure of information about X which the adversary obtains from Y . The perfect secrecy is defined as I ∞ (X; Y ) = 0 and the semantic security is defined as I ∞ (X; Y ) = O(1/m t ) [10, 11] .

We now borrow the above Shannon's cryptography models to construct data privacy models. In a (data) privacy model/system there are n ≥ 1 individuals X 1 , . . . , X n . A dataset x := {x 1 , . . . , x n } is a (multi)set of records, where each x i is an assignment of X i . For a query f : D → R [15] , a privacy model is defined as a set of (probabilistic) transformations of the set D of possible datasets into the set R of possible query outputs Y . Each particular transformation of the set is called a (privacy) mechanism. Note that, being different from the cryptography models, a mechanism does not need to be reversible since there is no deciphering step in the privacy models. 4 This implies that the data consumer and the adversary are indistinguishable in their ways to extract information contained in Y . For the query f and the dataset x, the output Y is a probabilistic approximation of f (x), which implies that we can use the expected distortion between f (x) and Y to measure data utility, whose formal definition is deferred until Section 3.2. Note that a differential privacy mechanism is a special kind of privacy mechanisms defined as above, which is formally defined in Definition 2.

Consider X i , Y as random variables, whose probability distributions are the adversary's probabilities for the choices in question, and represent his knowledge of the situation. Then the max-mutual information I ∞ (X i ; Y ) will be the amount of information about the individual X i which the adversary obtains from Y . Following the semantic security and the perfect secrecy, the setting I ∞ (X i ; Y ) ≤ ǫ with ǫ > 0 would be a reasonable choice as a privacy concept. One needs to be mentioned is that the "perfect privacy" [16, 17] , i.e., the setting I ∞ (X i ; Y ) = 0, is not practical since this will result in poor data utility even in the assumption of the PPT adversaries by the results in [6] . Due to technical reasons, the formal definition of the privacy concept is deferred until Section 3.

One may find an interesting thing that we seem to pick up the semantic security that Dwork et al. had claimed to be impractical to privacy problems [6, 2] [18, Section 2.2]. We stress that Dwork [6] mainly proves that the "perfect privacy", i.e., the setting I ∞ (X i ; Y ) = 0, is impractical due to poor data utility (even in the assumption of the PPT adversary), but seldom claims that I ∞ (X i ; Y ) ≤ ǫ is impractical. In this paper, we will continue Dwork's work [6] to discuss whether I ∞ (X i ; Y ) ≤ ǫ is suitable to be a privacy concept, and accurately in what extent to be; that is, we will employ Shannon's theory to answer Dwork et al.'s question [1] : "why Bayesian risk factors are the right measure for privacy loss". We will also continue Dwork's work [6] to discuss the tight upper bound of I ∞ (X i ; Y ) for the differential privacy output Y , which is obviously important but is neglected by Dwork [6] and the related works [19, 20] . In fact, we have the following result. 

Note that (1) is implied when M satisfies ǫ-differential privacy by the group privacy property of differential privacy in Lemma 1. Therefore, Corollary 1 implies that ǫ-differential privacy mechanism M allows its output Y such that I ∞ (X i ; Y ) ≈ nǫ, which will disclose too much information about the individual X i so long as the number n of individuals is large enough, and which is our main motivation. For the ǫ-differential privacy mechanism M, one interesting thing in Corollary 1 is that the nǫ in (1), which is intended to be the maximal amount of disclosed information, or in other words the privacy budget [1] , to the group X 1 , . . . , X n of individuals by the theory of differential privacy, however, becomes the maximal amount of disclosed information to the individual X i . We will show in Proposition 4 that this is due to the other individuals' data also contains information of the individual X i . One needs to be emphasized is that it is reasonable to accept I ∞ (X i ; Y ) ≤ ǫ as one minimal requirement for any secure privacy mechanism. The reason is the same as that I ∞ (X; Y ) ≈ 0 is one mininal requirement for secure cryptography models since large I ∞ (X; Y ) must result in information disclosure of the plaintext X, which has been testified for more than 60 years.

Definition 1 (The Knowledge of an Adversary). Let the random vector X := (X 1 , . . . , X n ) denote the uncertainty of an adversary to the queried dataset. Then X or its probability distribution is called the knowledge of the adversary.

Note that, before this paper, there have been many Bayesian inference-based privacy models, such as [17, 21, 19, 20, 22] . These models share a common feature: they all restrict adversaries' knowledges. Many results, such as those in [19, 22] and Proposition 4 of this paper, show that this restriction is inevitable for better utility. Traditionally, it is direct to restrict adversaries to be PPT as in cryptography. However, the current studies in data privacy don't suggest this restriction since most current works in data privacy are not based on it [18, 23, 24, 25] . On the other hand, the current works to restrict adversaries' knowledges are almost no discussion on what are reasonable assumptions [17, 21, 19, 20, 22] . Note that the main obstacle to adopt these privacy models is that these models put restrictions to adversaries' knowledges but can't provide the reasonability of these restrictions. In this paper, our restriction to adversaries' knowledges is shown in Assumption 1.

Assumption 1. Let b be a positive constant. Then, for any one adversary's knowledge X, there must be H(X) ≥ b, where H(X) is the entropy of X.

We have the following evidences to support the reasonability of the restriction.

1. The maximal entropy max X H(X), in general, is huger in privacy models than in cryptography models. For example, to the AES-256 encryption model [13] , the adversary only needs to recover the 256 bits secret key in order to recover the information contained in the output Y and therefore it is reasonable to assume that H(X) can be very small or even zero since H(X) is at most 256 bits. However, to the Netflix Prize dataset [26] in data privacy, the adversary, in principle, needs to recover the whole dataset in order to recover the information contained in the output Y 5 and therefore it is reasonable to assume that H(X) is relatively large since the Netflix Prize dataset is large and then max X H(X) is at least larger than 100, 480, 507 bits, which is huge compared to 256 bits. 6 2. The long tail phenomenon 7 implies that there are too much "outlier data" in big dataset, which increases the uncertainty H(X).

3. Someone may doubt of the assumption since there are too much background knowledge in data privacy protection compared to in cryptography. For example, to the Netflix Prize dataset [26] , it is inevitable that there exists open data, such as the IMDb dataset, as the adversary's background knowledge. Our comment is that, when the dataset is large enough, such as the Netflix dataset, the background knowledge, such as the IMDb dataset, in general, can't have large part, such as over 50%, to be overlapped with the secret dataset. In fact, the Netflix Prize dataset has very small part to be overlapped with the IMDb dataset. Therefore, the entropy H(X) is still large for big dataset even though the diversity of background knowledges. 4. Theoretically, a dataset can be completely recovered by querying the dataset too many times as noted in [27, 28] [18, Chapter 8] ; that is, theoretically, the entropy H(X) can be very small or even zero [9, p. 659 ]. However, if we restrict the query times 8 and assume the dataset is big enough, we can ensure H(X) to be not too small.

Due to the above evidences, it would be reasonable to adopt Assumption 1 as a reasonable restriction to adversaries' knowledges. Notice that Assumption 1 can achieve the idea of "crowd-blending privacy" (but with a way different from [30, 31] ), where each individual's privacy is related to other individuals' data; that is, if some other individuals' data is kept private, then Assumption 1 holds, which in turn ensure I(X i ; Y ) ≤ ǫ to be holding.

This paper aims to provide some "mathematical underpinnings of formal privacy notions" [1] and tries to answer "why Bayesian risk factors are the right measure for privacy loss" [1] by employing Shannon's cryptography model and Assumption 1. Our contributions focus on studying how to control I ∞ (X i ; Y ) and related quantities based on Assumption 1.

1. We introduce Assumption 1 into privacy models. Compared to the restrictions to adversaries' knowledges in [17, 21, 19, 20, 22] , the restriction in Assumption 1 is much more reasonable and universally applicable, especially for big datasets. 2. Four parameters are developed to characterize Assumption 1, which makes it easy to control I ∞ (X i ; Y ) and to discuss utility. This part is our main contribution; many bounds of I ∞ (X i ; Y ) and of utility are obtained. 3. We formalize the group privacy, i.e., the privacy of a group of individuals, and the composition privacy, i.e., the privacy problem when multiple results are output, of the information privacy model. Several results are proved.

The following part of this paper is organized as follows. Section 2 presents some preliminaries. Section 3 introduces the information privacy model and compares it with other privacy models. In Section 4 we discuss the tradeoffs between privacy and utility based on Assumption 1. Section 5 discusses how to preserve the privacy of a group of individuals. Section 6 discusses the privacy problem when multiple results are output. Section 7 gives other related works. Section 8 concludes the results.

The notational conventions of this paper are summarized in Table 1 , of which some are borrowed from information theory [14] .

This section provides mathematical settings of our model, where most materials contain many mathematical symbols and seem to be boring. However, we emphasize that these symbols are necessary to make the presentation clear and shorter. Therefore, the readers can skip these settings at a first reading and go back to consult them later where necessary.

Let the random variables X 1 , . . . , X n denote n individuals. Let X i denote the record universe of X i . The probability distribution of X i denotes an adversary's knowledge about the individual X i 's record. A dataset is a collection (a multiset) of n records x 1 , . . . , x n , where x i ∈ X i denotes the assignment of X i . We differentiate a record sequence (x 1 , . . . , x n ) from a dataset {x 1 , . . . , x n } the record sequence corresponds to: the former has order among the records but the later does not. The universe of record sequences Z is defined as Z = {(x 1 , . . . , x n ) : x i ∈ X i , i ∈ [n]}. The universe of datasets D is defined as D = {{x 1 , . . . , x n } : x i ∈ X i , i ∈ [n]}. We remark that D is not a multiset, in which the same datasets are merged as one dataset. There may be multiple record sequences which correspond to a same dataset. We call the dataset {x 1 , . . . , x n } as the dataset of the record sequence (x 1 , . . . , x n ). For a dataset y ∈ D, let D y denote the set of all record sequences corresponding to the same dataset y.

Set X = (X 1 , . . . , X n ). Set

In this manner, X can also be considered as a D-valued random variable with the probability distribution F (y), y ∈ D. Let P denote the universe of probability distributions over Z (or over D). Note that, by letting all adversaries' knowledges be derived from a subset ∆ of P, we achieve a restriction to adversaries'

knowledges. If the probability distribution of the random variable X is within ∆, we say that X is in ∆, denoted as X ∈ ∆.

For a query function f , let R ⊇ {f (x) : x ∈ D} denote a set including all possible query results. Let P(R) denote the set of all the probability distributions on R. A mechanism M takes a record sequence x ∈ Z as input and outputs a random variable M(x) valued in R. Let Y be the random variable denoting the adversary's observation about the output. In this manner, for x ∈ Z and r ∈ R, we set

In this paper, we abuse the notation M(x) as either denoting a probability distribution in P(R) or denoting a random variable following the probability distribution. Furthermore, for any x ∈ D, set M(y) ≡ M(z) for any two y, z ∈ D x . Therefore, for a dataset x ∈ D, we set M(x) := M(z) for z ∈ D x . In this paper, we append an empty record, denoted as ⊥, to each X i . In this setting, if x i = ⊥, it means that the individual X i does not generate record in the dataset

For a dataset x ∈ D, we use the histogram representation x ∈N |X | to denote the dataset x, where the ith entry of x represents the number of elements in x of type i ∈ X [32, 18, 33] . Two datasets x, y ∈ D are said to be neighbors (or neighboring datasets) of distance k if x−y 1 = k. If k = 1, x, y are said to be neighbors (or neighboring datasets). Two record sequences x, x ′ ∈ Z are said to be neighbors (or neighboring record sequences) if their corresponding datasets are neighbors.

For notational simplicity, in the following of this paper, we assume Z and D are both discrete.

Differential privacy characterizes the changes of outputs when one's record in a dataset is changed. The later changing is captured by the notion of the neighboring datasets.

Definition 2 (ǫ-Differential Privacy [5, 6, 18] ). Let the notations be as in Section 2.1. A mechanism M : D → P(R) gives ǫ-differential privacy if

where

Note that Definition 2 is the same as those in [34, 35] , and is also equivalent to the definition of differential privacy in [5, 6, 18] .

Differential privacy has group privacy property, which ensures that the strength of the privacy guarantee drops linearly with the size of the group of individuals.

Lemma 1 (Group Privacy [18] ). Let M be an ǫ-differentially private mechanism. Then max x,y∈D,r∈R: x−y 1≤s

The composition privacy of differential privacy implies that the strength of the privacy guarantee drops in a controllable way when the number of outputs about a dataset raises.

Lemma 2 (Composition Privacy [18] ). Let the mechanism M i satisfy ǫ idifferential privacy on R i for i ∈ {1, . . . , s}. Then the composition mecha- [36] ). The max-mutual information of the random variables X, Y is defined as

Proof. By the definition of I(X; Y ) [14] , there is

The claim is proved. ⊓ ⊔

Now it's time to give the formal definition of privacy concept. As discussed in Section 1, our privacy concept is to limit the amount of information of each individual X i obtained by the adversary from the output Y , i.e., control the value of the max-mutual information I ∞ (X i ; Y ) or the mutual information I(X; Y ). For mathematical convenience, we only consider how to control the quantity I ∞ (X i ; Y ) in this paper. We formalize the discussions in Section 1 as the following definition.

Definition 4 (ǫ-Information Privacy). Let ∆ ⊆ P. Let M : D → P(R) be a mechanism and let Y be the output random variable. The mechanism M satisfies ǫ-information privacy with respect to ∆ if for any X ∈ ∆ and i ∈ [n] there is max xi∈Xi,r∈R

Note that the inequality (8) is equivalent to

.

(10) The parameter ∆ in the above definition is used to model adversaries' knowledges. In this paper, we mainly set ∆ to be

which will be discussed in Section 4. In information theory, the relative entropy is used to measure the distance between two probability distributions and the mutual information is used to measure the amount of information that one random variable contains about another random variable [14] . The relative entropy of (X i |Y = r) and X i , denoted as D((X i |Y = r) X i ), and the mutual information of X i and Y , i.e., I(X i ; Y ), have the following results. Proposition 1. Let the mechanism M satisfies ǫ-information privacy with respect to X and let Y be its output random variable. We have

Proof. The proof is direct and is omitted here.

Note that, as Definition 4, we can also define the ǫ-relative entropy privacy, i.e., max r D((X i |Y = r) X i ) ≤ ǫ, and the ǫ-mutual information privacy, i.e., I(X i ; Y ) ≤ ǫ. Furthermore, the paper [37] proposes a privacy concept called ǫ-inferential privacy, i.e.,

≤ exp(ǫ). (13) Note also that the inequalities (1) and (2) in [19] are essentially equivalent to the inequality (13) . We now discuss the relations among the above three privacy concepts and the ǫ-information privacy. There are the following results.

Proposition 2. We have the following relation among the privacy concepts: ǫinferential privacy ⇒ a ǫ-information privacy ⇒ b ǫ-relative entropy privacy ⇒ c ǫ-mutual information privacy.

Proof. The claim ⇒ a is due to the inequality

The claim ⇒ b is due to Proposition 1. The claim ⇒ c is due to the equation

The claims are proved. ⊓ ⊔ Proposition 2 shows that the four privacy concepts, ǫ-inferential privacy, ǫinformation privacy, ǫ-relative entropy privacy and ǫ-mutual information privacy, are in decreasing order in terms of their strength to protect privacy. One can choose any one of the four concepts as the privacy concept, of which the choosing criterion depends on the privacy level of demand.

Proposition 3 (Data-Processing Inequality/Post-Processing). Assume the mechanism M : D → P(R) satisfies ǫ-information privacy with respect to ∆ and let Y be its output random variable. Let Z = g(Y ) and let R ′ = {g(r) : r ∈ R}. Then the composed mechanism g • M : D → P(R ′ ) satisfies ǫ-mutual information privacy with respect to ∆, where g • M(x) := g(M(x)) for x ∈ D. 9 Proof. Recall that ǫ-mutual information privacy of M is implied by its ǫ-information privacy by Proposition 2. Then the claim is a direct corollary of the dataprocessing inequality in [14, Theorem 2.8.1].

It is direct to define the personalized information privacy as the personalized differential privacy [38] .

Definition 5 (ǫ-Personalized Information Privacy). The mechanism M satisfies ǫ-personalized information privacy with respect to ∆ if, for each X ∈ ∆ and each i ∈ [n], there is

where ǫ = (ǫ 1 , . . . , ǫ n ). 9 Currently, we can't strengthen the result to be ǫ-information privacy since we can't prove the data-processing inequality to the max-mutual information I∞.

In this section we consider how to set the parameter D (or Z) of the information privacy model. The setting of the parameter ∆ is deferred to Section 4. One needs to be emphasized is that the dataset universe D (or the record sequence universe Z) should be set carefully since D itself may leak individuals' privacy and result in tracing attacks [28] . In order to see the above result clearly, we consider the query function f (x) = x, x ∈ D as an example, which can be considered as the abstraction of data publishing function [39, 40, 41] . Note that the codomain of f is R = {f (x) : x ∈ D} = D. Both of the differential privacy model and the information privacy model employ randomized techniques to protect privacy: When the real dataset is x, in order to preserve privacy, a privacy mechanism first samples a dataset y ∈ D (according to a probability distribution) and then outputs f (y) ∈ R as the final query result of f . Or equivalently, the privacy mechanism directly samples a value r from the codomain R of f as the final query result. The major difference of the two models is that the probability distributions used to sample y or r are different. Assume that the individual X i 's record universe X i has no overlapped record with all other individuals' record universes. Then, finding a record r i ∈ X i within an output dataset x would strongly conclude the participation of the individual X i , which obviously is a successful tracing attack. Therefore, we should set appropriate D and therefore appropriate X i for i ∈ [n] such that the set D itself does not leak the participation of an individual. The privacy-oriented (but less utility-oriented) setting is to set X i = X for all i ∈ [n] as in [18, p. 227 ].

For the query f and the dataset universe D, let the set R ⊇ {f (x) : x ∈ D}. We equip a metric d over the set R [15] . That is, the parity (R, d) is a metric space. Note that the output M(x) of the mechanism M is a probabilistic approximation of f (x). Therefore, for two datasets x, y, if x−y 1 is large, the distance of the outputs M(x), M(y) being small would result in poor data utility. In most parts of this paper, the above utility measuring method can be used to measure the utility of mechanisms. However, for the completeness of this paper, we will present the formal definition of utility measure in (15) . Note that the two utility measure methods are consistent since the former will result in M(x), x ∈ D be more similar with the uniform probability distribution on R, which obviously raises the distortion of d(f (X), Y ).

Let F o (x), x ∈ D denote the occurring probability distribution of the individuals X. Then the utility of the mechanism M is measured by the expected value of the distortion d(f (X), Y ), i.e.,

We (2), where the former is the factual occurring probabilities of datasets but the later denotes the knowledge of the adversary to datasets. The third quantity to measure the utility is I(X; Y ) or I ∞ (X; Y ), which is used to measure the information of the individuals X contained in the output Y . Note that large I(X; Y ) implies better utility of the mechanism since the output Y contains more information about X that the mining or learning algorithms can mine or learn.

One motivation of this paper is to solve the weakness of the differential privacy model [5, 6] as shown in Corollary 1, which implies that the differential privacy model allows I ∞ (X i ; Y ) to be very large. Corollary 2, which is also appeared in [19, 22] , shows that the differential privacy model is equivalent to the information privacy model with respect to P 1 . Note that the setting P 1 is obviously less reasonable than the setting (11) . Therefore, the information privacy model with respect to (11) is more reasonable than the differential privacy model.

As noted in Section 1, the models in [17, 21, 19, 20, 22] and the information privacy model all are the Bayesian inference-based models and restrict adversaries' knowledges; that is, they all employ a subset of P, like ∆ in this paper, to model adversaries' knowledges. The advantage of these models and the restrictions is clear: powerful both to model privacy problems and to balance privacy and utility. However, the disadvantage is also large: the restrictions seem to be unreasonable since there are many examples, where making such a restriction may quickly lead to a disastrous breach of privacy. We imagine that the first impressions of most readers to these privacy models in [17, 21, 19, 20, 22] are similar with ours: Compared to conciseness of the differential privacy model, these privacy models set too many kinds of ∆'s but none of these settings seems to be reasonable, which makes it hard to adopt these models. However, the rigorous analysis of the privacy problems by using Shannon's cryptography theory as in Section 1 makes us revisit these models, which results in the introduction of the parameter ∆ into the information privacy model. Of course, we also face the problem of how to find a reasonable ∆. Assumption 1 is our solution and the evidences in Section 1 show that it is reasonable, especially for big datasets. In the following sections of this paper we will present our results based on Assumption 1.

Furthermore, the papers [17, 19, 42, 43] discuss the impact of previously released data or query results, called constraints, to the privacy guarantee. The information privacy model treat these constraints by using Assumption 1; this is, these constraints can be summarized as the adversary's knowledge to the queried dataset, and if these constraints can't result in the adversary's knowledge go out of the set ∆ in (11), then we can ensure the adversary can only obtain little information of each individual. Note that the above treatment to the constraints is similarly with the semantic security model in cryptography.

The papers [44, 45, 34] employ either I(X; Y ) ≤ ǫ or I ∞ (X; Y ) ≤ ǫ to define privacy concepts. We stress that both of the above two inequalities will result in poor data utility. The reason is that I(X; Y ) or I ∞ (X; Y ) is just the amount of information of X contained in Y that the data consumer needs to mine since the data consumer is also a special kind of adversaries. In contrast, the inequalities I(X i ; Y ) ≤ ǫ, I ∞ (X i ; Y ) ≤ ǫ only restrict the information disclosure of each individual X i , which, in general, allows the quantities I(X; Y ) or I ∞ (X; Y ) to be large enough, so long as the number of individuals n is large enough.

In this section, we consider how to set the parameter ∆ in Definition 4 in order to give appropriate privacy-utility tradeoffs, where ∆ denotes adversaries' knowledges. As noted in Section 3, the setting in (11) is a reasonable restriction to adversaries' knowledges. Before discussing the information privacy model based on this setting, we first discuss why we must restrict adversaries' knowledges.

The following results show that the setting ∆ = P will result in poor utility.

Proposition 4. The following three conditions are equivalent:

Proof. The equivalence between the claim 1 and the claim 3 is due to

with equality when the record sequence (

is just the record sequence satisfying the above maximality.

The equivalence between the claim 2 and the claim 3 is due to

with equality when the record sequence x ′ ∈ Z satisfies Pr[X = x ′ ] = 1, where x ′ is just the record sequence satisfying the above maximality. The proof is complete.

The claim 2 of Proposition 4 shows that ǫ-information privacy with respect to P will result in poor utility since I ∞ (X; Y ) ≤ ǫ but I ∞ (X; Y ) denotes the information of X contained in Y , which is just the information the utility needs. Note also that the claim 3 of Proposition 4 shows that ǫ-information privacy with respect to P will result in two datasets even with distance n must have similar outputs, which obviously results in poor utility. Therefore, it is needed to restrict adversaries' knowledges for better utility. Now we discuss how to control the quantity I ∞ (X i ; Y ) with respect to (11) . We first formalize the reasons which make Assumption 1 hold. Note that

with equality to the first inequality if and only if X 1 , . . . , X n are independent, and with equality to the second inequality if and only if each X i has uniform distribution over X i [14] . Therefore, there are mainly two reasons which make H(X) ≥ b:

1. The random variables X 1 , . . . , X n are not strongly dependent.

2. There exist some X i 's with H(X i ) > 0.

Traditionally, we can use the mutual information I(X i ; X (i) ) and the entropy H(X i ) to characterize the above two reasons, respectively. However, for mathematical convenience, we develop four parameters to characterize them:

1. Use the parameter k to denote the maximal number of dependent random variables in X. 2. Use the parameter δ to denote the maximal dependent extent among the random variables X. 3. Use the parameter ℓ to denote the maximal number of random variables in X with H(X i ) > 0. 4. Use the parameter τ to characterize the minimal entropies of the above ℓ random variables.

Subsequently, also for mathematical convenience, we will approximate the set ∆ b in (11) with a set τ ℓ P δ k , which is parameterized by the four parameters k, δ, ℓ, τ and will be defined later; that is,

In the following parts of this section, we will explicitly define k, δ, ℓ, τ and then τ ℓ P δ k and discuss how to control I ∞ (X i ; Y ) based on them.

Recall that the parameter k denotes the maximal number of dependent random variables in X, which is mainly motivated by the group privacy method in [46] to deal with the dependent problem and by the need to explain differential privacy using the information privacy model. Let P k be the largest subset of P such that, for any X ∈ P k , the maximal number of dependent random variables within X is at most k, where 1 ≤ k ≤ n. Formally, let

where I ∈ [n] with |I| ≤ k, each x I ∈ X I , each x i ∈ X i for i ∈ [n]−I. Note that, in this manner, P equals P n and P 1 denotes the universe of probability distributions of the independent random variables X 1 , . . . , X n . We have the following result. 

Proof. Let X = (X i ,X (i) ,X (i) ) ∈ P k , whereX (i) ,X (i) denote the random variables in X which are independent to and dependent to X i , respectively. Let x (i) ,x (i) andX (i) ,X (i) denote one assignment and the record universe ofX (i) ,X (i) , respectively. "⇐" Assume the inequality (21) holds. For onex (i) , set

where = a is due to the independence betweenX (i) and X i , and ≤ b is due to the inequality (21) and that there are at most k random variables in (X i ,X (i) ).

"⇒" Assume M satisfies ǫ-information privacy with respect to P k . Without loss of generality, assume the two datasets (x 1 ,x (1) ,x (1) ), (x 1 ,x (1) ,x (1) ) ∈ D of distance ≤ k andr ∈ R satisfy max x,y∈D,r∈R: x−y 1≤k

We construct the following probability distribution in P k . Set Pr[

by the first two lines of the equation (22) . Furthermore, since M satisfies ǫinformation privacy with respect P k , we have

which gives (21) by the equation (23). The proof is complete. ⊓ ⊔ Note that P n = P. There are the following corollaries for P 1 and P. 

and therefore if and only if M satisfies ǫ-differential privacy. 

Corollary 2, which is also appeared in [19] and is in some extent equivalent to [22, Theorem 4.5, Theorem 4.8] , implies that the differential privacy model effectively controls I ∞ (X i ; Y ) when the adversary's knowledge X are independent random variables.

Corollary 3 is equivalent to [19, Theorem 3.1] , which is also appeared in Proposition 4 and implies Corollary 1. It implies that the differential privacy model can't effectively control I ∞ (X i ; Y ) when the adversary's knowledge X are dependent random variables.

Notice that there is a drawback when using Theorem 1 to balance privacy and utility: hard to set the value of k. This is because of that small k will result in bad privacy since, in general, this may result in the parameter b in Assumption 1 to be large, and then result in Assumption 1 doesn't hold, and that large k will obviously result in poor utility. Therefore, except the parameter k, there should be another one parameter δ to model the dependent extent among X, which is the task of Section 4.2.

The parameter δ denotes the dependent extent among X, which is mainly motivated by the "correlated sensitivity" in [47] , the "dependence coefficient" in [48] and the "multiplicative influence matrix" in [37] .

The dependence among the individuals is popular. For example, the spreading of the Black Death in the 14th century 10 and the spreading of the SARS coronavirus in 2002-2003 11 (if without effective controlling) show that people all over the world are dependent. Furthermore, the small world phenomenon [49, 50] also shows the dependence among people. However, the dependent extent of these relationships are low and therefore an adversary, in general, will have low dependent relationship knowledge. We now consider how to measure the dependent extent among X.

Traditionally, it is appropriate to use I(X i ; X (i) ) to measure the dependent extent between X i and X (i) . However, for mathematical convenience, we develop a new quantity δ to measure it. Roughly speaking, I(X i ; X (i) ) uses

to measure it. Note that the independence among X ensures that, for each X i , there are Pr[X (i) = x (i) ] = Pr[X (i) = x (i) |X i = x i ] for all x i ∈ X i and all x (i) ∈ X (i) . This implies that the weak dependence among X will result in Pr[X (i) = x (i) ] ≈ Pr[X (i) = x (i) |X i = x i ] for all x i ∈ X i and all x (i) ∈ X (i) , and then result in

for any two records x i , x ′ i ∈ X i and all x (i) ∈ X (i) . By setting

we can therefore use

to measure the dependence extent between X i and X (i) . In this manner, we can use

10 https://en.wikipedia.org/wiki/Black_Death 11 https://en.wikipedia.org/wiki/SARS_coronavirus to measure the dependence extent among X; that is, if σ is small (≈ 0), the dependence extent among X would be weak. Note that 0 ≤ σ ≤ 1 since

In the following part of this section, for notational simplicity, we set a

denote the set of probability distributions satisfying σ ≤ exp(δ), where δ ∈ [−∞, 0] and then exp(δ) ∈ [0, 1]. Then smaller δ implies that X are low dependent if X ∈ P δ . Therefore, we can use δ to denote the dependence extent among X. We have the following results about ∆ = P δ .

Theorem 2. Assume the mechanism M satisfy ǫ/n-differential privacy. Then

for all X ∈ P δ .

Proof. Let X ∈ P δ . For any i ∈ [n] and any

where the inequality ≤ a is due to the fact that M satisfies ǫ/n-differential privacy, the inequality ≤ b is due to Lemma 3 and the group privacy property of M, and the inequality ≤ c is due to X ∈ P δ . The claim is proved. ⊓ ⊔ Theorem 3. Assume M satisfies ǫ-information privacy with respect to P δ . Then

Proof. Assume there existī ∈ [n],r ∈ R,xī,x ′ ı ∈ Xī andx (ī) ,x ′ (ī) ,x ′′ (ī) ∈ X (ī) such that the left side of (33) equals

We construct a probability distribution X ∈ P δ as follows. Set Pr[Xī =x ′ ı ] = 1,

Then, by setting X to be the above probability distribution, we have that (34) equals

, which ensures the inequality (33) by combining the ǫ-information privacy of M. ⊓ ⊔ Theorem 2 and Theorem 3 have very interesting connections with Corollary 2 and Corollary 3. Corollary 2 and Corollary 3 show the utilities of an ǫ-information privacy mechanism for the cases exp(δ) = 0 and exp(δ) = 1, respectively, whereas the left side of (33) is a (mediant-like 12 ) linear combination of the left sides of (26) and (27) with the weight exp(δ). Furthermore, since max x,x ′ ∈D,r∈R

is equivalent to I ∞ (X i ; Y ) ≤ ǫ/n with respect to P which is equivalent to exp(δ) = 1, and (26) is equivalent to I ∞ (X i ; Y ) ≤ ǫ with respect to P 1 which is equivalent to exp(δ) = 0, the bound of I ∞ (X i ; Y ) in (36) is just the linear combination of the above two bounds of I ∞ (X i ; Y ) with the weight exp(δ). Therefore, Theorem 2 and Theorem 3 provide a (in some extent) sufficient and necessary condition, which is a tradeoff between the sufficient and necessary conditions in Corollary 2 and Corollary 3 with the weight exp(δ), and which shows how the parameter δ balances privacy and utility. By combining Theorem 1 and Theorem 2, we have the following result.

Corollary 4. Assume the mechanism M satisfy ǫ/k-differential privacy. Then

for all X ∈ P δ k , where P δ k = P k ∩ P δ .

Proof. The proof of the theorem is the combination of Theorem 2 and the proof techniques of Theorem 1. ⊓ ⊔

The parameters ℓ, τ are motivated partially by the parameters "k, δ" in [21, Definition 2.4] and partially by the need and the works, such as [51, 30, 19, 20, 22] , to relax the differential privacy model to obtain better utility. We now discuss how to relax the differential privacy model, from which the parameters ℓ, τ are derived. By Corollary 2, M satisfies ǫ-differential privacy if and only if M satisfies ǫ-information privacy with respect to P 1 . Note that P 1 contains those probability distributions X such that H(X i ) = 0 for most or even all i ∈ [n]; that is, the adversary can know most or even every records in the dataset, which is a too strong assumption when the dataset is big enough as discussed in Section 1. Therefore, it is reasonable to assume that there exists a set I ⊂ [n] of individuals such that H(X i ) > 0 for i ∈ I. Formally, set τ ℓ P = X ∈ P : exp(−τ ) ≤

where x i ∈ X i , ℓ ∈ [n] and τ ≥ 0. Note that

ensures H(X i ) ≥ log |X i | − τ . Then, by setting ∆ in Definition 4 to be τ ℓ P 1 := τ ℓ P ∩ P 1 ,

we can generate a relaxation to the differential privacy model. Set

We now consider the case where ℓ = n − k. Specifically, let

where

Note that, for each X ∈ τ n−k P k , there is

We have the following result.

Theorem 4. For any r ∈ R, any i ∈ [n] and any I ′ ⊂ [n] \ {i} such that

≤ exp(ǫ), (44) then the mechanism M satisfies ǫ-information privacy with respect to τ n−k P k , where J = [n] \ ({i} ∪ I ′ ) and where, for each i ∈ I ′ , the X i satisfies (38) .

Proof. Let X ∈ τ n−k P k . Without loss of generality, let |I ′ | = n − k − 1 such that, for each i ∈ I ′ , the random variable X i satisfies (38) .

(45) The claim is proved.

⊓ ⊔ By setting τ = 0, we have the following result.

Corollary 5. For any r ∈ R, any i ∈ [n] and any I ′ ⊂ [n] \ {i} such that

then the mechanism M satisfies ǫ-information privacy with respect to 0 n−k P k , where J = [n] \ ({i} ∪ I ′ ) and where, for each i ∈ I ′ , the X i satisfies (38) where τ = 0.

The inequalities (44) and (46) are two expectation-case relaxations of the worst-case inequality (21) . Of course, we must acknowledge that the results in Theorem 4 and Corollary 5 are somewhat weak. Currently, we are unable to further simplify the inequalities (44) and (46) since we face some complicated inequalities which are related to the generalized mediant inequalities 13 . We hope, in future, we can find new approaches to simplify them.

The idea of Section 4 is to first discuss the tradeoffs of privacy and utility based on P k , P δ and τ ℓ P individually, and then synthesize these results as those based on τ ℓ P δ k , where τ ℓ P δ k := P δ k ∩ τ ℓ P = P k ∩ P δ ∩ τ ℓ P.

The results of Section 4 show that the "divide and conquer" approach works. We stress that our final aim is to discuss how to control I ∞ (X i ; Y ) based on

Note that k, δ are the quantities to substitute for the mutual information to measure the dependent extent among X. Currently, we are unable to know the quantitative relation between k, δ and the mutual information I(X i ; X (i) ), which results in that we are unable to know the quantitative relation between k, δ, ℓ, τ and b. Nevertheless, at least qualitatively, we can find suitable k, δ, ℓ, τ to let (48) hold.

Clearly, setting ∆ to be τ ℓ P δ k would be more reasonable and flexible than to be P 1 , which implies that, theoretically, ǫ-information privacy with respect to τ ℓ P δ k will achieve more privacy and utility than ǫ-differential privacy by Corollary 2.

Proposition 4 shows the badness of big datasets to data privacy; that is, when the number n of individuals increases, the utility of data must be bad in order to satisfy information privacy. Conversely, Assumption 1 shows the goodness of big datasets to data privacy; that is, when the number n of individuals increases, the lower bound b of the uncertainty H(X)'s of adversaries in the set

increases accordingly, which provides us opportunities to improve the utility of data. Explicitly, when n increases, the parameter b increases, which results in that the parameters δ, ℓ increase, the parameter τ decreases and the parameter k increase (but slowly than n), which provides us opportunities to improve the utility of data by using the results in Section 4. This in some extent implies that the information privacy model achieves the so called "crowd-blending privacy" [30] , but of a flavor different from [30] ; that is, an individual's privacy is "blended" with the adversaries' uncertainty to other individuals' data.

Computational Complexity Relaxation Note that the perfect secrecy can be considered as a special case of the information privacy by setting n = 1, ∆ = P, ǫ = 0. Similarly, the semantic security [10, 11] can also be considered as a special case of the information privacy, roughly, by setting n = 1, ǫ = O(1/ log t |X |) and ∆ = P ppt , where P ppt is the subset of P that the PPT adversaries can evaluate. Also, it is direct to define "computational" information privacy similar as in [11, 52] , just by setting

and ǫ = ǫ + O(1/(n log |X |) t ). Note that the zero-knowledge privacy model [53] is essentially equivalent to the information privacy with respect to P ppt . One important thing is to discuss the information privacy with respect to

Noticing that there have been many works on "computational" differential privacy [52, 54, 55, 56, 57, 58, 59, 60, 61] , the "computational" information privacy with respect to τ ℓ P δ k,ppt would be one interesting future work.

The group privacy problem is to study how to preserve the privacy of a group of individuals. Let I = {i 1 , . . . , i s } ⊆ [n] and X I = (X i1 , . . . , X is ). The group privacy of the group of individuals X I is to let the mutual information I(X I ; Y ) or the max-mutual information I ∞ (X I ; Y ) be controllable. Differential privacy has the good property that a mechanism satisfying ǫdifferential privacy will ensure to satisfy 1-group differential privacy as shown in Lemma 1, which implies that ǫ-information privacy with respect to P 1 implies 1group information privacy with respect to P 1 by Corollary 2. We now generalize this result to P k .

Theorem 5. Assume M satisfies ǫ-information privacy with respect to P k . Then M satisfies 1-group information privacy with respect to P k , where k ∈ [n].

Proof. Let X ∈ P k . By using the proving techniques in Theroem 1, we have

where ≤ a is due to (21) . The claim is proved. ⊓ ⊔ Currently, we lack some techniques to prove the group privacy properties for P δ , τ ℓ P and then τ ℓ P δ k . However, we believe they are true, whose proofs would be one future work.

The composition privacy problem is to study how to guarantee privacy while multiple datasets or multiple query results are output. There are two kinds of scenarios. First, multiple query results of one dataset are output. We call this kind of scenario as the basic composition privacy problem. To differential privacy, the privacy problem of this scenario is treated by the composition privacy property [39, 41, 62] as shown in Lemma 2. Second, multiple query results of multiple datasets generated by the same group of individuals are output, respectively. We call this knid of scenario as the general composition privacy problem. For example, the independent data publications of data of the Netflix and the IMDb [26] , the independent data publications of the online and offline data [63] , and the independent data publications of the voter registration data and the medical data [31] . For each of the above applications, the composition attack [64, 26] techniques may employ the relationship between/among different datasets/queries to infer the privacy of individuals whose data is contained in these datasets.

The basic composition privacy problem, i.e., the privacy problem of multiple queries of a dataset, can be modeled as follows. For the sources X = (X 1 , . . . , X n ) and the s query outputs Y 1 , . . . , Y s , the composition privacy is to let

Definition 7 (Basic Composition Privacy). Assume M i satisfies ǫ i -information privacy with respect to ∆ and let Y i be its output random variable, i ∈ [s]. Then the composition mechanism M, which is defined as

is said to satisfy c-basic composition information privacy with respect to ∆ if, for each X ∈ ∆, there are

where Y = (Y 1 , . . . , Y s ), c is a positive constant, and

for r = (r 1 , . . . , r s ).

Note that, by combining Lemma 2 with Corollary 2, we have that ǫ-information privacy with respect to P 1 implies 1-basic composition information privacy with respect to P 1 . We now generalize this result to P k . Theorem 6. Let M be as shown in Definition 7 and let ∆ = P k . Then M satisfies 1-basic composition privacy with respect to P k .

Proof. Let X ∈ P k . By using the proving techniques in Theroem 1, we have

where ≤ a is due to (54) . The claim is proved. ⊓ ⊔

In this section, we discuss the general composition privacy problem. We remark that this problem is different from the basic composition privacy problem in Section 6.1, where the former is to output different privacy-preserving results of the different datasets generated by a same group of individuals but the later is to output different privacy-preserving results of the same dataset. Except some simple discussions in [19, Section 9.1] , this is an almost unexplored problem in privacy protection. In this scenario, an individual X i should be represented by a stochastic process X i := {X t i : t ∈ T } (but not a random variable as in former sections). The output Y also should be represented by a stochastic process Y := {Y t : t ∈ T }. In this setting, we need to control the value of the mutual information

That is, the outputs {Y t : t ∈ T } should contain little information of each individual {X t i : t ∈ T } that the adversary can obtain. For the information privacy, we need to control the quantity

which is formalized as the following definition.

Definition 8 (General Composition Privacy/Privacy for Stochastic Processes). Let X := (X 1 , . . . , X n ) be the sources, where each X i := {X t i : t ∈ T } is a stochastic process. Let P be the universe of probability distributions of X and let let ∆ ⊆ P. Let P {t} be the universe of probability distributions of X t := (X t 1 , . . . , X t n ) and let ∆ t ⊆ P {t} . Set x = (x 1 , . . . , x n ), where x i := {x t i : t ∈ T } with x t i ∈ X t i . Set X (i) = (X 1 , . . . , X i−1 , X i+1 , . . . , X n ) and x (i) = (x 1 , . . . , x i−1 , x i+1 , . . . , x n ). Set X i = t∈T X t i , X (i) = j∈[n]\{i} X j . Let M be a mechanism and let Y := {Y t : t ∈ T } be the output stochastic process valued in a domain R := t∈T R t , where, for each r := {r t : t ∈ T } ∈ R, there are r t ∈ R t , t ∈ T . For each x = {x t : t ∈ T } ∈ n i=1 X i , a mechanism M := {M t : t ∈ T } is defined as

Assume each M t satisfies ǫ t -information privacy with respect to ∆ t , t ∈ T . Then the mechanism M satisfies c-general composition privacy with respect to ∆ if, for each X ∈ ∆, there are

where c is a positive constant, and

We now show that the basic composition privacy problem in Definition 7 is a special case of the general composition privacy problem in Definition 8 where, for each stochastic process X i := {X t i : t ∈ T }, the random variables X t i , t ∈ T are all equal. 14 The result is shown in the following proposition.

Proposition 5. Let the notations be as shown in Definition 8. Assume, for each i ∈ [n], the random variables X t i , t ∈ T in the stochastic process X i := {X t i : t ∈ T } are all equal. Then, for any x 1 i and any r, there is

since the random variables X 1 i , . . . , X

The equation (59) follows by the above two results. The equation

is an immediate corollary of the equation (59) .

⊓ ⊔ Theorem 7. Let M be as shown in Definition 8. For each t ∈ T , let M t satisfy ǫ t -information privacy with respect to P {t} k , which is similarly defined as in (20) . Then M satisfies 1-general composition privacy with respect to P k .

Proof. The proof is similar with the proof of Theorem 6.

⊓ ⊔ 14 The definition of the equality of random variables please see https://en.wikipedia.org/wiki/Random_variable Independent Applications Scenario For some applications, the assumption of the independence among different applications are reasonable. For example, for a group of individuals, their shopping data in Amazon would be independent (or less dependent) to their research data in DBLP, or their health data would be independent (or less dependent) to their movie rating data in IMDb. Therefore, for the above applications, it is in some extent reasonable to assume that an adversary's knowledge is only limited to be the independent relationship of different datasets. This setting can be modeled as that the |T | random vectors (X 1 , Y 1 ), . . . , (X |T | , Y |T | ) are independent, where X j = (X j 1 , . . . , X j n ). Proposition 6. Let the notations be as shown in Definition 8. Assume the |T | random vectors (X 1 , Y 1 ), . . . , (X |T | , Y |T | ) are mutually independent. Then, for each i ∈ [n], any x i ∈ X i and any r ∈ R, there are

and (62) is due to the independence of the |T | random vectors

Note that, in Proposition 6, for each t, the random variables X t 1 , . . . , X t n , Y t do not need to be mutually independent.

The information privacy model for stochastic processes in Definition 8 is powerful to model the privacy problems of many complicated application scenarios, such as those applications in the start of Section 6. For each of these applications scenarios, the composition attack [64, 26] technique may employ the relation between/among different datasets to infer the privacy of individuals whose data is contained in these datasets. Definition 8 accurately models an adversary's knowledge about relationship among the datasets generated by the individuals and our idea is to set

In this manner, the information privacy model would be immune to the composition attack. Currently, we lack some techniques to prove the composition privacy properties for P δ , τ ℓ P and then τ ℓ P δ k . However, we believe they are true, whose proofs would be one future work.

Furthermore, Definition 8 is also suitable to model the privacy problems of the streaming data [65] , the set-valued data [41] and the trajectory data [66] applications. Notice that, when modeling these application scenarios, the submechanisms M 1 , . . . , M |T | may be dependent ; that is, the equation (58) would not hold. (Of course, these application scenarios are more suitable to be modeled by Definition 4 where both each X i and Y are a stochastic process.)

Data privacy protection has a long history [67, 4] and has been developing rapidly for the last decade [3, 2, 24, 25] . We now briefly summarize other related works.

The k-anonymity model [31] is the first privacy model that obtains extensive study. To a dataset, its main idea is to generalize the identifiers and the semi-identifier attributes to ensure at least k records has the same identifiers and the semi-identifiers. However, the k-anonymity model does not change the sensitive attributes in order to preserve data utility. Later, researchers find that the sensitive attributes themself disclose privacy. This urges the development of many variants of the k-anonymity model, such as the ℓ-diversity [68] and tcloseness [69] among others. However, these variants still have many drawbacks. Therefore, a more rigorous privacy model is needed.

There are a lot of works to adapt differential privacy to be resistant to dependent relationship attacks. The paper [70] is believed to be the first to point out that differential privacy is vulnerable to the dependent relationship attack. The paper [46] uses the group privacy property of differential privacy (Lemma 1) to deal with the dependent relationship attack. Explicitly, if there are at most k sources are dependent, one can alleviate the influence of dependent sources to the privacy guarantee of differential privacy by achieving ǫ/k-differential privacy or, equivalently, by multiplying k to the global sensitivity of the query function. This treatment is similar with the result of Theorem 1 and motivates the parameter k in Section 4.1. However, since large k will result in poor utility as discussed in the last part of Section 4.1, the paper [47] and the paper [48] introduce the notions of "correlated sensitivity" and "dependence coefficient", respectively. The two notions can be explained as introducing a dependent coefficient (which is much less than 1 in general) between/among individuals to decrease the raising speed of the global sensitivity of the query function. The two notions motivate the parameter δ in Section 4.2. Although the two notions can add less noise than the group privacy method, the privacy guarantee of the two methods have less theoretical foundation, whereas the result in Theorem 2 achieves similar aim as the above methods but with strong privacy guarantee as shown in Proposition 1.

The paper [71] is an application of the Pufferfish model and designs some mechanisms. The paper [35] relates differential privacy with the conditional mutual information I(X i ; Y |X (i) ), where I(X i ; Y |X (i) ) ≤ ǫ is proved to be weaker than ǫ-differential privacy but stronger than (ǫ, δ)-differential privacy. By combining the above result with Corollary 2, we have that the quantity I(X i ; Y |X (i) ) ≤ ǫ can't resist the dependent relation attack. By Proposition 2, the ǫ-inferential privacy model in [37] can be considered as a special case of our model. Furthermore, the method to measure dependence extent in Corollary 4 is more simpler than the corresponding one in [37] since the latter needs to compute complicated matrix operations, such as the matrix inverse.

The paper [72] relates the utility function in Exponential mechanism to the rate distortion function and then discusses the relation between information leakage and privacy. Another kind of work for treating differential privacy via information theory is to measure the bound of noise complexity of differential privacy output [36, 73, 54] . Our model uses the relative entropy D(X i (X i |Y = r)) and the mutual information I(X i ; Y ) to treat the dependent sources problem of differential privacy. The results in Proposition 1 show that the information privacy model can ensure individual information disclosure to be upper bounded by a small value ǫ.

The outlier privacy model [51] tries to reduce the influence of the outlier records to the model's output. Conversely, the information privacy model tries to "utilize" the outlier records; specifically, in general, the more outlier records in the queried dataset, the more larger of H(X) and then the more larger of b in Assumption 1, which provides more opportunities to improve utility.

The main obstacle to adopt Bayesian inference-based privacy models is that these models put restrictions to adversaries' knowledges but can't provide the reasonability of these restrictions. This paper shows that Assumption 1 is a very reasonable restriction to adversaries' knowledges and simultaneously allows flexible approaches to balance privacy and utility.

Of course, we must acknowledge that, even though there are many reasonable evidences, Assumption 1 is really not as stronger as the hardness assumptions in cryptography; the latter are founded on the computational complexity theory [74, 11, 10] but the former seems can't. This reminds us of Edmonds' remark [75] : "It would be unfortunate for any rigid criterion to inhibit the practical development of algorithms which are either not known or known not to conform nicely to the criterion." Therefore, maybe, we should allow the existence of Assumption 1 due to its usefulness in data privacy even though it doesn't conform nicely to the computational complexity criterion.

Furthermore, this paper leaves many unsolved problems. First, the utility bounds about the parameter ℓ, τ need to be further explored. Second, we only prove the group privacy and the composition privacy properties about the parameter k; the two properties about the parameters δ, ℓ, τ also need to be explored. Third, how to control I(X i ; Y ) and max r∈R D(X i |Y = r X i ) is another one urgent future work in order to provide more choices to balance privacy and utility. Fourth, the computational information privacy with respect to ∆ in (51) is another one interesting future work.

Privacy-preserving data analysis for the federal statistical agencies

A firm foundation for private data analysis

Privacy-preserving data publishing: A survey of recent developments

Privacy-Preserving Data Mining -Models and Algorithms

Calibrating noise to sensitivity in private data analysis

Differential privacy

Membership privacy in microrna-based studies

Robust traceability from trace amounts

Communication theory of secrecy systems

Probabilistic encryption

Theory and applications of trapdoor functions (extended abstract)

The Foundations of Cryptography

The Design of Rijndael: AES -The Advanced Encryption Standard. Information Security and Cryptography

Elements of information theory

Analytic theory to differential privacy

Towards a methodology for statistical disclosure control

A formal analysis of information disclosure in data exchange

The algorithmic foundations of differential privacy

Pufferfish: A framework for mathematical privacy definitions

Coupled-worlds privacy: Exploiting adversarial uncertainty in statistical data privacy

Relationship privacy: output perturbation for queries with joins

Membership privacy: a unifying framework for privacy definitions

Signal processing and machine learning with differential privacy: Algorithms and challenges for continuous data

Differentially private data publishing and analysis: A survey

Robust de-anonymization of large sparse datasets

Revealing information while preserving privacy

Exposed! a survey of attacks on private data

Research progress in the complexity theory and algorithms of big-data computation (in Chinese)

Crowd-blending privacy

k-anonymity: A model for protecting privacy

The geometry of differential privacy: the sparse and approximate cases

The matrix mechanism: optimizing linear counting queries under differential privacy

On the relation between identifiability, differential privacy, and mutual-information privacy

Differential privacy as a mutual information constraint

Maxinformation, differential privacy, and post-selection hypothesis testing

Inferential privacy guarantees for differentially private mechanisms

Conservative or liberal? personalized differential privacy

Privtree: A differentially private algorithm for hierarchical decompositions

Differentially private sequential data publication via variable-length n-grams

Publishing set-valued data via differential privacy

Blowfish privacy: tuning privacyutility trade-offs using policies

Bayesian differential privacy on correlated data

Privacy against statistical inference

Privacy-utility tradeoff under statistical uncertainty

Correlated network data publication via differential privacy

Correlated differential privacy: Hiding information in non-iid data set

Dependence makes you vulnerable: Differential privacy under dependent tuples

Networks, Crowds, and Markets -Reasoning About a Highly Connected World

Complex Graphs and Networks (Cbms Regional Conference Series in Mathematics)

Outlier privacy

Computational differential privacy

Annual International Cryptology Conference

Towards privacy for social networks: A zero-knowledge based definition of privacy

The limits of two-party differential privacy

Accuracy-privacy tradeoffs for two-party differentially private protocols

Distributed private data analysis: Simultaneously solving how and what

Extremal mechanisms for local differential privacy

Do distributed differentially-private protocols require oblivious transfer

Limits of computational differential privacy in the client/server setting

Separating computational and statistical differential privacy in the client-server model

Black-box separations for differentially private protocols

Differentially private data release for data mining

From online behaviors to offline retailing

Composition attacks and auxiliary information in data privacy

Differential privacy under continual observation

Differentially private transit data publication: a case study on the montreal transportation system

Security-control methods for statistical databases: A comparative study

l-diversity: Privacy beyond k-anonymity

t-closeness: Privacy beyond k-anonymity and l-diversity

No free lunch in data privacy

Pufferfish privacy mechanisms for correlated data

Information-theoretic foundations of differential privacy

Information-theoretic bounds for differentially private mechanisms

New directions in cryptography

Paths, trees and flowers

the entropy of X X (i) the random vector (X1, . . . , Xi−1, Xi+1, . . . , Xn)where ⊥ denotes an empty record X the sequence of the individuals (X1, . . . , Xn) Z the universe of record sequences n i=1 Xi D the universe of datasets R a set containing the query function f 's codomain {f (x) : x ∈ D} P the universe of probability distribution over Z (or over D) ∆ a subset of P X ∈ ∆ the probability distribution of X is in ∆ k the maximum number of dependent individuals δ the dependent extent among the individuals τ the parameter to measure the uncertainty of the adversary to each individual ℓ the parameter to measure the number of unknown individuals P δ the subset of P with dependent parameters ≤ δ P k the subset of P with dependent parameters ≤ k τ ℓ P the subset of P with parameters τ, ℓ τ ℓ P δ k the set P k ∩ P δ ∩ τ ℓ P PPT the abbreviation of "probabilistic polynomial time" P ppt the subset of P that the PPT adversaries can evaluate