key: cord-0188856-1ys0r73c authors: Alserr, Nour Almadhoun; Ulusoy, Ozgur; Ayday, Erman; Mutlu, Onur title: GenShare: Sharing Accurate Differentially-Private Statistics for Genomic Datasets with Dependent Tuples date: 2021-12-30 journal: nan DOI: nan sha: adc626937260522f1a392c28b47e911a0c2e6d58 doc_id: 188856 cord_uid: 1ys0r73c Motivation: Cutting the cost of DNA sequencing technology led to a quantum leap in the availability of genomic data. While sharing genomic data across researchers is an essential driver of advances in health and biomedical research, the sharing process is often infeasible due to data privacy concerns. Differential privacy is one of the rigorous mechanisms utilized to facilitate the sharing of aggregate statistics from genomic datasets without disclosing any private individual-level data. However, differential privacy can still divulge sensitive information about the dataset participants due to the correlation between dataset tuples. Results: Here, we propose GenShare model built upon Laplace-perturbation-mechanism-based DP to introduce a privacy-preserving query-answering sharing model for statistical genomic datasets that include dependency due to the inherent correlations between genomes of individuals (i.e., family ties). We demonstrate our privacy improvement over the state-of-the-art approaches for a range of practical queries including cohort discovery, minor allele frequency, and chi^2 association tests. With a fine-grained analysis of sensitivity in the Laplace perturbation mechanism and considering joint distributions, GenShare results near-achieve the formal privacy guarantees permitted by the theory of differential privacy as the queries that computed over independent tuples (only up to 6% differences). GenShare ensures that query results are as accurate as theoretically guaranteed by differential privacy. For empowering the advances in different scientific and medical research areas, GenShare presents a path toward an interactive genomic data sharing system when the datasets include participants with familial relationships. The fast-paced high throughput sequencing technologies result in generating a tsunami of large-scale datasets and biobanks. The number of sequenced human genomes has been increasing at an exponential rate, and now we are at about 2.5 million sequenced genomes around the world. This is projected to reach 105 million genomes in 2025 (1), especially after the COVID-19 pandemic, where many countries have decided to study genomic data at a population scale. These rich troves of data are becoming the keystone for empowering medical science advances. Researchers need large amounts of genomic datasets that they can leverage to gain a better understanding of 1) the genetic basis of the human genome and identify associations between phenotypes and specific parts of DNA, and 2) disease diagnosis and treatment (e.g., personalized medicine (2) ). However, since the human genome is the utmost personal identifier, it is normally discouraged to share genomic data due to the privacy concerns and the possible legal, ethical, and financial consequences, as well as the data protection guidelines in many countries. Hence, sharing genomic data while preserving the privacy of the individuals has been challenging for many different fields (e.g., medicine, science, bioinformatics) (3) . The challenge worsens when sharing large datasets or their statistics as they are usually vulnerable to privacy leaks due to the inherent correlations between genomes of participating family members (4; 5) . For the hope of sharing genomic datasets and aiming at gaining more accurate and refined biomedical insights, researchers have proposed applying differential privacy (DP) concept (6) as a protective measure against several inference attacks over genomic dataset (e.g., Homer attack (7)). Informally, a (randomized) algorithm A is differentially private if its output distribution is approximately the same when executed on two inputs (e.g., datasets D and D ) that differ by the presence of a single individual's data (i.e., neighboring datasets). This condition prevents an adversary with access to the algorithm output from learning anything substantial about any one individual since the probability of observing a certain outcome for the neighboring datasets does not differ by more than a multiplicative factor of exp . is referred to as the privacy budget, where smaller values of give stronger guarantees of privacy. DP methods are widely-used for privately sharing the summary statistics after adding an adequate noise. One of the DP common approaches is to add Laplace noise (i.e., Laplace perturbation mechanism (LPM)) (8) based on the global sensitivity (GS) of the statistics query (i.e., the maximum difference between the query results A(D) and A(D ) is at most GS(A)). (9; 10; 11) developed differentially-private algorithms that release different queries in a privacy-preserving way from statistical genomic studies, such as genome wide association studies (GWAS). These queries include but are not limited to 1) count or cohort discovery: to query how many participants in the dataset satisfy given criteria, 2) χ 2 association tests: compute χ 2 statistics for a point mutation (single nucleotide polymorphism SNP), or 3) minor allele frequency (MAF): to compute the frequency of which the rare nucleotide occurs at a particular SNP. Despite the rigorous mathematical foundation of DP (12; 13; 14) and the fact that only aggregate-level information is shared, DP mechanisms can still leak sensitive information about the participated individuals if the dataset includes dependent tuples (i.e., family members). It is a common situation for genomic datasets to have dependency between their tuples (or records) due to the inherent correlations between genomes of individuals that have family ties. In our previous work (4; 5), we demonstrate the feasibility of attribute and membership inference attacks on differentially private query results by exploiting the dependence between tuples. Our evaluation over real-world statistical genomic datasets shows how kinship relations between individuals participating in a genomic dataset cause a significant reduction in the privacy guarantees of traditional DPbased mechanisms. Current studies have attempted to propose general mechanisms to tackle this problem, such as Pufferfish (15) , and its extensions (16; 17; 18; 19) . However, these efforts fail to capture the statistical relationships between dependent tuples in genomic datasets, and hence resulting in sub-optimal solutions limiting their effectiveness in practice. They either lack the privacy (degrade rigorous guarantees of privacy) or the utility (introduce an excessive amount of noise), as we show in our evaluation in Section 4. Therefore, there is a critical need for a fine-grained analysis of LPM sensitivity considering different queries over genomic datasets to fill an unmet need for privacy-preserving genomic data sharing when the dataset includes dependent tuples. This will encourage both the healthcare stakeholders and data donors (including families) to widely share and use such valuable data resources. Our goal in this paper is to enable privacy-preserving sharing of summary statistics from genomic data with dependent tuples by achieving the privacy and utility (encompassing accuracy) guarantees of the standard DP assuming all the participants of the dataset are independent (i.e., independent tuples). To achieve this goal, we aim at preserving the privacy of the genomic data donors by analyzing and perturbing the query results using a controlled noise in order to minimize the probability of leaking undesired information. We propose GenShare model that provides rigorous theoretical guarantees of DP formulation in terms of privacy and utility. The key idea of GenShare is to 1) theoretically analyze statistical relationships between the tuples in the genomic datasets to infer both pairwise correlations and complex joint correlations between multiple participants, 2) compute the dependence sensitivity (σ) sensitivity, how much each query can reveal out of such statistical relationships, and 3) take effective DP protective measures based on each query sensitivity. Focusing on three types of real-world queries: (1) count or cohort discovery, (2) MAF, and (3) χ 2 tests, we empirically demonstrate the privacy and utility improvements of our proposed DP-based mechanism for each query type. We provide a use case on how our GenShare could be used to enable data sharing with privacy. Our key theoretical advances show that an LPM based approach, combined with a fine-grained computation for the sensitivity performed by the data owner (i.e., entity which collects/generates the genomic dataset), provably achieves the expected data utility of the shared query results, while maintaining the privacy guarantees of DP that can be obtained when the query is computed over independent tuples. This paper makes the following contributions: • Introducing a query-answering sharing model "GenShare" for genomic datasets with formal privacy guarantees, while ensuring that the query results are as accurate as theoretically guaranteed by DP. • Providing an effective LPM-based analysis based on the dependent and independent tuples included in the query computations, which is more accurate and robust than most similar existing approaches. • Following the real-world workflows in recent studies for different queries, we show the robustness of GenShare using a range of queries such as cohort discovery, MAF, and χ 2 over real-world statistical genomic datasets. • Achieving almost the same privacy guarantees (in terms of estimation error, which is commonly used to quantify genomic privacy) as the query that is computed over independent tuples. To our knowledge, GenShare is the first model that dynamically and effectively tailors the DP protective measures based on each query sensitivity to protect the privacy of individuals who have simple/complex correlations participating in the genomic dataset, while simultaneously maximizing the benefits of data sharing for science. The rest of this paper is organized as follows: Section 2 presents related prior work on DP mechanisms under dependent tuples. Section 3 explains our proposed privacy model "GenShare", followed by Section 4 where we evaluate our proposed GenShare model and compare it to the state-of-art mechanisms. Section 5 presents the conclusion and highlights future research directions that are pointed by this paper. Several studies have questioned whether DP is valid for correlated data. (15) was the first to raise the issue of privacy degradation when DP is applied over a dataset with correlated tuples. To this end, existing solutions that try to handle the correlation between tuples in the datasets can be categorized into two types, by considering: 1) the dependency between different tuples (i.e., individual-individual correlations), and ii) the dependency among single individual's data at different time-series (streams) entries (i.e., temporal correlations). destroying the data utility. As a generalization of DP, (21) proposes another general and customizable method called Pufferfish to handle the dependent tuples by adjusting the Laplace scale, however, the main challenge of Pufferfish is the lack of suitable mechanisms to achieve the expected privacy guarantees. Following this general approach of Pufferfish, the baseline approach proposed by (17) tries to handle the correlation by multiplying the original sensitivity of the query with the number of correlated records b (i.e., query sensitivity = b × query original sensitivity). Bayesian DP (22) uses a modification of Pufferfish, but it only focuses on modeling the tuples correlation by the Gaussian Markov Random Fields. All the following studies such as (18; 19; 4) are trying to adjust the sensitivity by introducing dependence coefficients according to the number of correlated data, considering the pairwise correlation between dataset tuples as in (18) or using heuristic analysis (empirically-computed query sensitivity) as in (4). Following the second setting to handle the temporal correlations, (23; 29) propose sharing statistics and counts of a data stream considering horizontal correlations. In (23), they propose two algorithms for the Wasserstein mechanism and the Markov Quilt mechanism when the correlations can be modeled by Bayesian Network. (24) also considers the temporal correlation which can be modeled by a Markov Chain. In Section 4.4, we compare our model (in terms of privacy) with the existing similar approaches from the two aforementioned categories (18; 4; 23; 20) . Since Hidden Markov would not work to model statistical genomic dataset, we are not comparing our model with the mechanisms proposing hidden Markov-based models (22; 23; 29). As discussed in Section 2, some researchers have proposed general mechanisms to tackle the degradation in the privacy guarantees of DP that happens on account of the dependency between database tuples (21) . However, this privacy risk has not yet been studied for statistical genomic datasets (which potentially include many dependent tuples due to dependency/correlations between genomes of individuals that have family ties) and existing mitigation (17; 19; 29; 18; 4) fail to theoretically capture the statistical relationships between dependent tuples in genomic datasets, and hence resulting in sub-optimal solutions considering privacy and utility. As a first step towards mitigation of this risk, following a similar analysis as in (18) (but modeling the correlations differently, i.e., joint correlations considered), we propose GenShare as a formalization of -DP notion for genomic datasets with dependent tuples. Among all family trees in a dataset D, we denote the one with the strongest relationships (i.e., the one with the largest aggregate kinship coefficient between any individual and the other family members) as the strongest dependent tuple set and represent it as B (|B|= b). We let D and D be neighboring datasets with b dependent tuples (i.e., among b dependent tuples, D and D differ in one record) if the change of one tuple value in D causes change of at most (b−1) tuple values in D . Thus, we define GenShare for genomic datasets with dependent tuples using this notion of neighboring datasets, and to achieve the guarantees of -DP, we re-formulate LPM by introducing a new fine-grained "sensitivity" definition ς for genomic datasets that include dependent tuples, as follows: Proof: To prove Theorem 3.1 and compute σ(B), we consider a simple query function to publish a sanitized versionD of a dataset D with b dependent tuples. Among these b dependent tuples, we have the participant j and participants in set ψ, where ψ may contain more than one tuple. To satisfy -DP under this scenario we have: where A is a randomized algorithm, i represents the sanitized version of a data point (SNP) i, x i j represents the SNP i value of individual j, and To achieve -DP, we add Laplace noise proportional to the query's global sensitivity, by using a proper Laplace scale ω for the Laplace distribution, where ω = ∆Q/ . Our goal is to find a proper scaling factor ω when sharing statistics from a dataset with dependent tuples by changing the original global sensitivity ∆Q to ς. By transforming the left-hand side of Equation 1 using the law of total probabilities, we have: Here,ā is a vector representing the values of the SNPs in x i ψ . A includes the set of vectors for potential values ofā (considering Mendel's law and the relationships of the dependent tuples in the dataset). Also, f (ā) is a function that computes the sum of SNP values inā. To compute the potential values inā, we develop probabilistic models representing the evolution of an SNP value over multiple generations. For this, based on Mendel's law, we find the family relationships between individuals and compute the probabilities of moving from one SNP value to another, from one generation to the next.The right-hand side of Equation 2 contains two terms: the first left term considers the change in the SNP i of individual j from the value h to h', and the second right term that considers the change in the SNP i of individuals in ψ (due to the dependency between j and individuals in ψ) given the change in x i j from the value h to h'. For the first left term of the right-hand side of Equation 2, we have: where ∆x i j represents ∆Q which is the maximum change in x i j from the value h to h'. If we ignore the second right term of the right-hand side The scale ω for the Laplace distribution is: ω = ∆x i j which is compatible with the Laplace scale in the standard DP mechanism. To study the effect of the the maximum change in an individual j's data on b-1 dependent tuples (in ψ), we focus on the second right term of the right-hand side of Equation 2 to define σ as follows: Combining Equation 1-5, we have: Therefore, we represent the dependent sensitivity for sharing the results of query Q over a genomic dataset with dependent tuples as ς = ∆ x i j + σ(B) = ∆Q + σ(B). We derive the dependent sensitivity σ(B) as: In practice, depending on over which individuals a query is computed, first the strongest dependent tuple set B among such individuals is determined, and then, the corresponding dependent sensitivity σ(B) is computed. Furthermore, we observe that the inference power of an adversary may be affected by the number of dependent tuples (i.e., family members) and independent tuples (i.e., unrelated members) included in the query results. Hence, in our sensitivity analysis, ω (i.e., LPM scale) value can be neatly chosen to find the adequate value of σ(B). We show our heuristic analysis on how to choose ω in Section 4.4. To clarify our previous computations, here we consider a simple query function to publish a sanitized versionD of a dataset D with b dependent tuples. Among these b dependent tuples we have the participants j and k, and o, where k and o ∈ ψ. To satisfy -DP for genomic datasets with dependent tuples we have: By transforming the left-hand side of Equation 8 using the law of total probabilities, we have: Therefore, we derive the dependent sensitivity σ as: Let dataset D includes n individuals and m SNPs. We assume a statistical query (e.g. MAF) is computed over q participants in D, including a target j and other p dataset participants (q = 1+p). Set F (|F| = f ≤ d) includes individuals from the same family (i.e., target j and his/her family members), and set U (|U| = u) includes the other unrelated members (non-relatives) in the dataset. We show the overview of our proposed GenShare model in Figure 1 . The entity which collects/generates the genomic dataset is the "data owner" and the data owner can share statistics about its dataset with a client (i.e., researcher or physician). This is a common way to share research findings. Following the attack scenario proposed by (4), to limit the number of dataset members included in the query result, the client (or adversary) sends its query specified by some demographic properties (e.g., age, address). As an example, we consider here the MAF query by the client (or adversary). First, the data owner computes the result of the query on the dataset, and meanwhile, he determines the number of family members f and unrelated members u included in the query results. Based on that, the data owner computes σ(B) and then applies LPM to the query results, then he sends them to the client. Data owner reports (i) the query result (MAF of all SNP values for the dataset participants that are considered in the query computation) and (ii) the number of dataset participants that are used to compute the query results (q). To evaluate the privacy performance of our proposed model GenShare, we use the correctness metric over a real-world statistical genomic dataset to show the robustness of GenShare. We next discuss our evaluation in detail. We combine three statistical genomic datasets that include genomic data of 1) family members and, 2) unrelated members (non-relatives). Our final genomic datasets contain the partial DNA sequences from: • CEPH/Utah Pedigree 1463 (25) : to obtain the genotypes of 10 family members (originally 17 members) from variant call format (VCF) files. • Manuel Corpas (MC) Family Pedigree (26) : to obtain the genotypes of a scientist named Manuel Corpas (the target in our experiments) and his 4 family members. • 1000Genome phase 3 data (27) : to obtain data for the unrelated individuals from the same or different population of the target and his family members. We extracted the genotypes from chromosomes 1 and 22 for 2504 participants from 23 populations using the Beagle genetic analysis package (28) (to extract the number of minor alleles for each SNP). In a statistical genomic dataset (e.g., GWAS) with n individuals and m SNPs, (9) computes the sensitivity for privacy-preserving release of cell counts as 2 (i.e., Laplace noise with scale 2/ ), while the MAF sensitivity can be computed as 2m n and χ 2 statistics as 4n (n+2) . (11) claim that adding Laplace noise with scale 2 to the cell count of genomic dataset results in accurate χ 2 statistics or p-values. In GenShare, we use these algorithms to calculate the global sensitivity of the queries ∆Q. For evaluating GenShare, we use correctness metric to quantify the privacy-preserving guarantees of GenShare. Estimation error is used to quantify the correctness by measuring the distance Dist between the true value of the SNP and the inferred value by the client (e.g., adversary). For a statistical genomic dataset D with m SNPs, we measure the expected estimation error E as follows: Here, x i j is the true value of SNP i for the target individual j, while x i j is the estimated value. We can compute the probabilities for x i j using the Mendelian inheritance probabilities for a SNP i given all the potential SNP values (i.e., 0, 1, or 2) for x i j (represented as D i j ). As discussed in Section 4.1, we use a dataset D to evaluate GenShare and compare it with the state-of-the-art mechanisms. D includes n individuals (n= 2520) and m SNPs for each individual (m = 1000). To infer the values of these m SNPs, we repeat our experiments 10× considering 100 SNPs (i.e., 100 queries are performed) each time. In our evaluation, we assume that the query can include the target (e.g., individual j) with 1) a direct family member, 2) multiple family members, or 3) multiple family members, and other unrelated individuals. We compare our model (in terms of privacy) with the existing similar work (discussed in Section 2) such as (18; 4; 23; 20) . Since Hidden Markov would not work to model kinship relations in a genomic dataset, we are not comparing our model with the mechanisms proposing Hidden Markov-based models. In the following, we compare our proposed model (referred to as "GenShare" in the figures) with: (i) independent assumptions (referred to as "Independent Assumption" in the figures) to show that GenShare can be proven by preventing any client from utilizing the dependencies among the dataset tuples to infer more sensitive attributes about dataset participants (in other words, we are aiming at achieving the privacy guarantees of the standard DP assuming all the participants of the dataset are independent), (ii) the proposed mitigation algorithm in (4) (referred to as "Almadhoun et. al." in the figures), (iii) dependent sensitivity mechanism proposed in (18) (referred to as "Liu et. al." in the figures), (iv) Wasserstein algorithm proposed in (23) (referred to as "Wasserstein" in the figures), and (v) Group DP proposed in (20) (referred to as "Group DP" in the figures). In Figure 2 , we evaluate the effect of different values of the privacy budget, , on the adversary's correctness in inferring the targeted m SNPs considering a different number of family members included in the query results. We evaluate the estimation error using 18 different values (i.e., is not continuous, 0.1 ≤ ≤ 4) divided into 4 intervals as shown in the legend of Figure 2 . Here, the count query (used in cohort discovery) results include the statistics from the family members only. First, we start including 1 firstdegree family member (e.g., mother or father) from MC family with the target j. Then, we include both mother and father with the target j to the query results. Third, we include father, mother, and sister in the query results. Last, we consider a second-degree family member (aunt of the target j) in the query results along with the father, mother, and sister of the target as shown in the (x-axis) of Figure 2 . We make the following key observations: (i) GenShare achieves the best privacy overall, it provides almost the same privacy guarantees (in terms of estimation error), as the query that is computed over independent tuples (i.e., independent assumption). Hence, our model succeeds in near-achieving the standard differential privacy guarantees without any degradation in terms of privacy or utility across several values. (ii) Existing techniques generally cannot optimize their schemes to achieve the required privacy and utility guarantees. They either add too much noise (e.g., f= 2 members in the figure) or degrade rigorous guarantees of privacy (e.g., as when f ≥ 3 members). (iii) As expected by DP, decreasing the privacy budget values (starting from = 4 descending until = 0.1) leads to increasing the privacy guarantees while decreasing the utility guarantees. Next, in Figure 3 , we include family members (father and mother) and other unrelated members (u= 5 in Figure 3 (a) and u= 10 in Figure 3 (b)) with the target j to evaluate the effect of different values of the privacy budget, , on the adversary's correctness in inferring the targeted m SNPs. Considering a count query, we observe that GenShare achieves better privacy for various privacy budgets, compared to the existing techniques even when the query results include unrelated members, as illustrated in Figure 3 . Figure 4 shows that GenShare is equivalent to DP mechanism when the query results only include unrelated members, unlike the existing techniques (18; 23; 20) , which compute the dependent sensitivity based on the number of dependent tuples in the dataset, ignoring whether these dependent tuples are included in the query or not. In our sensitivity analysis in Section 3 we observe that the inference power of an attacker decreases with an increasing number of independent tuples in the query computation. Hence, ω (i.e., LPM scale = ∆Q/ ) value can be neatly chosen to find the adequate value of σ(B) considering the number of dependent and independent tuples in the query computation. Since the ∆Q (i.e., the query sensitivity) is computed considering the query type (illustrated in Section 4.2), the data owner in our model can Figure 5 . As expected, adding more unrelated members to the query results leads to more precise sensitivity computations until reaching the sensitivity of the standard DP mechanism (i.e., ς = ∆Q). Next, we compare the performance of GenShare when first-degree or second-degree family members (from MC and UTAH families) are included in the query computations with the target. Our results show the robustness of GenShare regardless of the degree of familial relationship between the dataset tuples. The differences in privacy guarantees between GenShare and the "Independent Assumption" do not exceed 5% across a range of privacy parameters , with respect to estimation error ( Figure 6 ). Finally, we compare the performance of GenShare for different query types, e.g., count, MAF, and χ 2 tests. As expected, we observe that the differentially private statistics calculated based on GenShare provide accurate and near-optimal matching to the privacy guarantees of DP with "Independent Assumption", with a difference up to 6% in terms of estimation error across a range of privacy parameters (Figure 7) . Overall, our results illustrate the theoretical boundaries of leveraging LPM-DP for mitigating the "tuples dependency" privacy risk in genomic query-answering systems. GenShare is vital for genomic data sharing and in a broader sense, it will also have implications for medical data sharing. Considering i) the importance of sharing statistical genomic and medical datasets (which is the aim that many institutes are seeking to achieve) for high-impact medical research (e.g., NIH recently awarded $73 million to collect and archive the information of genes and genomic variants for precision medicine (30) ) and, ii) the sensitivity of the (personal) information in these datasets (especially there is a high probability to have families in these genomic datasets), data owners should be very careful when sharing data related to such datasets. Moreover, GenShare can be utilized to provide strong insights to several clients from different parties about each other's datasets (e.g., before they exchange datasets for joint research). Such privacy-preserving sharing mechanism may be helpful to accelerate the data sharing process across researchers, especially with the worldwide strict regulations of data protection for sharing and exchanging data. Fig. 7 : Comparison between applying GenShare for count, MAF, and χ 2 queries. GenShare reduces the differences from "Independent Assumption" privacy guarantees (in terms of estimation error), considering different values and the 3 query types Differential privacy provides a theoretical notion of privacy that provides formal guarantees that the distribution of query results changes slightly with the addition or removal of a single tuple in the dataset. However, privacy guarantees of DP-based solutions are based on the assumption that all tuples in the dataset are independent. In reality, genomic data from different individuals may be dependent according to the genomic interactions due to the familial ties between them. In this paper, we propose GenShare to provide countermeasures against privacy risks due to dependent tuples in the statistical genomic datasets. To achieve the privacy and utility guarantees theoretically provided by DP, GenShare captures the joint statistical relationships between dependent tuples in the genomic datasets. Our results show that GenShare provides a significant improvement in the privacy and utility guarantees over existing mechanisms across a range of privacy parameters . All of these contributions will benefit the medical and genomics research community, in the long run, and realize the promise of privacy-preserving access to the genomic datasets that are relied upon in future health information exchange systems. There are several directions that merit further research. It may be possible for us to consider: 1) more concepts in differential privacy, such as local sensitivity, 2) complex tasks and applications such as federated machine learning, 3) different settings e.g., larger number of queries or composing multiple queries. Big data: astronomical or genomical & Others Rapid whole-genome sequencing decreases infant morbidity and cost of hospitalization Privacy challenges and research opportunities for genomic data sharing Differential privacy under dependent tuples-the case of genomic privacy Inference attacks against differentially private query results from genomic datasets including dependent tuples Differential privacy: A survey of results Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays Smooth sensitivity and sampling in private data analysis Privacy-preserving data sharing for genome-wide association studies Scalable privacy-preserving data sharing methodology for genome-wide association studies Privacy-preserving data exploration in genomewide association studies Privacy-preserving biomedical database queries with optimal privacy-utility trade-offs Achieving differential privacy of genomic data releasing via belief propagation Enabling Secure and Privacy-Preserving Exploration of Distributed Clinical and Genomic Data No free lunch in data privacy Blowfish privacy: Tuning privacy-utility trade-offs using policies Correlated network data publication via differential privacy. The VLDB Journal-The International Journal On Very Large Data Bases Dependence Makes You Vulnberable: Differential Privacy Under Dependent Tuples.. NDSS. 16 pp Dependent differential privacy for correlated data & Others The algorithmic foundations of differential privacy A rigorous and customizable framework for privacy Bayesian differential privacy on correlated data Pufferfish privacy mechanisms for correlated data Quantifying differential privacy under temporal correlations & Others Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays Crowdsourcing the corpasome Consortium, 1. & Others A global reference for human genetic variation A one-penny imputed genome from next-generation reference panels Pegasus: Data-adaptive differentially private stream processing The National Human Genome Research Institute: NIH awards $73m to continue building resource of genes and genomic variants for precision medicine