Scalable Bias-corrected Linkage Disequilibrium Estimation Under Genotype Uncertainty Scalable Bias-corrected Linkage Disequilibrium Estimation Under Genotype Uncertainty David Gerard Department of Mathematics and Statistics, American University, Washington, DC, 20016, USA Abstract Linkage disequilibrium (LD) estimates are often calculated genome-wide for use in many tasks, such as SNP pruning and LD decay estimation. However, in the presence of genotype uncertainty, naive approaches to calculating LD have extreme attenuation biases, incorrectly suggesting that SNPs are less dependent than in reality. These biases are particularly strong in polyploid organisms, which often exhibit greater levels of genotype uncertainty than diploids. A principled approach using maximum likelihood estimation with genotype likelihoods can reduce this bias, but is prohibitively slow for genome-wide applications. Here, we present scalable moment-based adjustments to LD estimates based on the marginal posterior distributions of the genotypes. We demonstrate, on both simulated and real data, that these moment-based estimators are as accurate as maximum likelihood estimators, and are almost as fast as naive approaches based only on posterior mean genotypes. This opens up bias-corrected LD estimation to genome-wide applications. Additionally, we provide standard errors for these moment-based estimators. All methods are implemented in the ldsep R package on GitHub https://github. com/dcgerard/ldsep. 1 Introduction Pairwise linkage disequilibrium (LD), the statistical association between alleles at two different loci, has applications in genotype imputation [Wen and Stephens, 2010], genome-wide association studies [Zhu and Stephens, 2018], genomic prediction [Wientjes et al., 2013], population genetics [Slatkin, 2008], and many other tasks [Sved and Hill, 2018]. LD is often estimated from next-generation sequencing technologies, where the genotypes and haplotypes are not known with certainty [Gerard et al., 2018]. Thus, researchers typically use estimated genotypes, such as posterior mean genotypes [Fox et al., 2019], to estimate LD. However, this can cause biased LD estimates, attenuated toward zero, implying loci are less dependent than in reality. This bias is particularly strong in polyploids, and so in Gerard [2020] we derived maximum likelihood estimates (MLEs) that have lower bias and are consistent estimates of LD. Unfortunately, the MLE approach is prohibitively slow. Researchers typically calculate pairwise LD at genome-wide scales, and the MLE approach takes on the order of a tenth of a second. Thus, for many genome-wide applications, containing millions of SNPs, LD estimation using the MLE approach would take years of computation time. This is not conducive to large-scale applications. Keywords and phrases: attenuation bias, genotype likelihood, linkage disequilibrium, polyploidy, reliability ratio. 1 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint https://github.com/dcgerard/ldsep https://github.com/dcgerard/ldsep https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ Here, we derive scalable approaches to estimate LD that account for genotype uncertainty (Sec- tion 2). Our methods use only the first two moments of the marginal posterior genotype distribution for each individual at each locus, which are often provided or easily obtainable from many geno- typing programs. We calculate sample moments from these posterior moments, and use these to multiplicatively inflate naive LD estimates. We show, through simulations (Section 3.1) and real data (Section 3.2), that our estimates can reduce attenuation bias and improve LD estimates when genotypes are uncertain. All calculations have computational complexities that are linear in the sample size, and so these estimates are scalable to genome-wide applications. 2 Methods In this section, we will define moment-based estimators of the LD coefficient ∆ [Lewontin and Kojima, 1960], the standardized LD coefficient ∆′ [Lewontin, 1964], and the Pearson correlation ρ [Hill and Robertson, 1968]. We will only consider estimating the “composite” versions of these LD measures which, advantageously, are appropriate LD measures for generic autopolyploid, allopoly- ploid, and segmental allopolyploid populations, even in the absence of Hardy-Weinberg equilibrium [Gerard, 2020]. We will also only consider biallelic loci, where the genotype for each individual is the dosage (from 0 to the ploidy) of one of the two alleles. To define our estimators of LD, we assume the user provides the posterior means and variances for the genotypes for each individual at two loci. The full posterior genotype distribution for each individual is often provided by genotyping software [Gerard et al., 2018, Gerard and Ferrão, 2019, e.g.], from which these posterior moments can be obtained. If genotype posteriors are not provided, genotype likelihoods may be normalized to posterior probabilities (assuming a uniform prior) and used in what follows. Let XiA and XiB be the posterior means at loci A and B for individual i ∈ {1, . . . ,n}. Let YiA and YiB be the posterior variances at loci A and B for individual i. Our estimators are based entirely on the following sample moments of these posterior moments, which may be calculated in linear time in the sample size, n. uxA := 1 n n∑ i=1 XiA, uxB := 1 n n∑ i=1 XiB, (1) vxA := 1 n− 1 n∑ i=1 (XiA −uxA)2, vxB := 1 n− 1 n∑ i=1 (XiA −uxB)2, (2) cx := 1 n− 1 n∑ i=1 (XiA −uxA)(XiB −uxB), (3) uyA := 1 n n∑ i=1 YiA, and uyB := 1 n n∑ i=1 YiB. (4) For a K-ploid species, our LD estimators, which we derive in Section S1, are as follows. The estimated LD coefficient is ∆̂ := ( uyA + vxA vxA )( uyB + vxB vxB )(cx K ) . (5) 2 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ The estimated Pearson correlation is ρ̂ := √ uyA + vxA vxA √ uyB + vxB vxB cx√ vxAvxB . (6) Note that cx/ √ vxAvxB is the sample Pearson correlation between posterior mean genotypes. The estimated standardized LD coefficient is ∆̂′ := ∆̂/∆̂m, where (7) ∆̂m := { min{uxAuxB, (K −uxA)(K −uxB)}/K2 if cx < 0, and min{uxA(K −uxB), (K −uxA)uxB}/K2 if cx > 0. (8) Equations (5)–(7) take the naive estimators most researchers use in practice (the sample covari- ance/correlation of posterior means) and inflate these by a multiplicative effect. Such multiplicative effects are sometimes called “reliability ratios” in the measurement error models literature [Fuller, 2009]. Due to sampling variability, this inflation could result in estimates that lie beyond the theo- retical bounds of the parameters being estimated. In such cases, we apply the following truncations. ρ̃ := { max{ρ̂,−1} if ρ̂ < 0 min{ρ̂, 1} if ρ̂ > 0 (9) ∆̃ := { max{∆̂,− √ (vxA + uyA)(vxB + uyB)/K} if ∆̂ < 0 min{∆̂, √ (vxA + uyA)(vxB + uyB)/K} if ∆̂ > 0 (10) ∆̃′ := { max{∆̂′,−K} if ∆̂′ < 0 min{∆̂′,K} if ∆̂′ > 0 (11) Standard errors are important for hypothesis testing [Brown, 1975], read-depth suggestions [Maruki and Lynch, 2014], and shrinkage [Dey and Stephens, 2018]. Because estimators (5)–(7) are functions of sample moments, deriving their standard errors can be accomplished by appealing to the central limit theorem, followed by an application of the delta method (Section S2). Additional considerations for improving our estimates of the reliability ratios, such as using hierarchical shrinkage [Stephens, 2016], are considered in Section S3. All methods are implemented in the ldsep R package on GitHub https://github.com/dcgerard/ ldsep. 3 Results 3.1 Simulations We compared our moment-based estimators (5)–(7) to those of the MLE of Gerard [2020] as well as the naive estimator that calculates the sample covariance and sample correlation between posterior mean genotypes at two loci. Each replication, we generated genotypes for n ∈{10, 100, 1000} indi- viduals with ploidy K ∈{2, 4, 6, 8} under Hardy-Weinberg equilibrium at two loci with major allele frequencies (pA,pB) ∈{(0.5, 0.5), (0.5, 0.75), (0.9, 0.9)} and Pearson correlation ρ ∈{0, 0.5, 0.9}. We then used updog’s rflexdog() function [Gerard et al., 2018, Gerard and Ferrão, 2019] to generate read-counts at read-depths of either 10 or 100, a sequencing error rate of 0.01, an overdispersion 3 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint https://github.com/dcgerard/ldsep https://github.com/dcgerard/ldsep https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ value of 0.01, and no allele bias. Updog was then used to generate genotype likelihoods and genotype posterior distributions for each individual at each SNP. These were then fed into ldsep to obtain the MLE, our new moment-based estimator, and the naive estimator. Simulations were replicated 200 times for each unique combination of simulation parameters. The accuracy of estimating ρ2 when pA = pB = 0.5 at a read-depth of 10 is presented in Figure 1. The results for other scenarios are similar and may be found on GitHub (https:// github.com/dcgerard/ldfast_sims). We see that the moment-based estimator and the MLE perform comparably, even for small read-depth and sample size. The naive estimator has a strong attenuation bias toward zero. This bias is particularly prominent for higher ploidy levels. For example, for an octoploid species where the true ρ2 is 0.81, the naive estimator appears to converge to a ρ2 estimate of around 0.25. This bias does not disappear with increasing sample size. Estimated standard errors are reasonably well-behaved, except for ρ̂ and ρ̂2 when the sample size is small and the LD is large (Figure 2). 3.2 LD estimates for Solanum tuberosum We evaluated our methods on the autotetraploid potato (Solanum tuberosum, 2n = 4x = 48) genotyping-by-sequencing data from Uitdewilligen et al. [2013]. We used updog [Gerard et al., 2018, Gerard and Ferrão, 2019] to obtain the posterior moments for each individual’s genotype at each SNP on a single super scaffold (PGSC0003DMB000000192). To remove monoallelic SNPs, we filtered out SNPs with allele frequencies either greater than 0.95 or less than 0.05, and filtered out SNPs with a variance of posterior means less than 0.05. This resulted in 2108 SNPs. We then estimated the squared correlation between each SNP using either the naive approach of calculating the sample Pearson correlation between posterior means, or using our new moment-based approach (6). Our estimators are scalable. On a 1.9 GHz quad-core PC running Linux with 32 GB of memory, it took a total of 1.9 seconds to estimate all pairwise correlations using our new moment-based approach, which is a small increase over the 0.7 seconds it took to estimate all pairwise correlations using the naive approach. In Gerard [2020], we found that the MLE approach took about 0.1 seconds for each pair of SNPs for a tetraploid individual. Extrapolating this to 2108 SNPs would indicate that the MLE approach would take about 2.5 days of computation time to calculate all pairwise LD estimates on this dataset ( ( 2108 2 ) ×0.1sec×1min/60sec×1hr/60min×1d/24hr = 2.57d). The histogram of estimated reliability ratios are presented in Figure 3. We see there that the reliability ratios of most SNPs only increase their correlation estimates by less than 10%. But a not insignificant portion have reliability ratios that increase the correlation estimates by more than 10%. To evaluate the LD estimates of high reliability ratio SNPs, we calculated the MLEs for ρ2 between the twenty SNPs with the largest reliability ratios. A pairs plot for ρ2 estimates between the three approaches is presented in Figure 4. We see there that the MLE and new moment-based approach result in very similar ρ2 estimates, while the naive approach using posterior means results in much smaller ρ2 estimates. 4 Discussion It has been known since at least the time of Spearman that the sample correlation coefficient (or, similarly, the ordinary least squares estimator in simple linear regression) is attenuated in the presence of uncertain variables [Spearman, 1904]. Methods to adjust for this bias include assuming prior knowledge on the measurement variances or the ratio of measurement variances (resulting 4 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint https://github.com/dcgerard/ldfast_sims https://github.com/dcgerard/ldfast_sims https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ from, for example, repeated measurements on the same individuals) [Koopmans, 1937, Degracie and Fuller, 1972], using instrumental variables [Carter and Fuller, 1980], and using distributional assumptions [Pal, 1980]. See Fuller [2009] for a detailed introduction to this vast field. Our solution was to use sample moments of marginal posterior moments which, to our knowledge, has never been proposed before. It is natural to ask if our methods could be used to account for uncertain genotypes in genome- wide association studies. However, the moment-based techniques we used in this manuscript, when applied to simple linear regression with an additive effects model (where the SNP effect is pro- portional to the dosage), result in the standard ordinary least squares estimates when using the posterior mean as a covariate (Section S4). This supports using the posterior mean as a covariate in simple linear regression with an additive effects model. This is not to say, however, that using the posterior mean is also appropriate for more complicated models of gene action [Rosyara et al., 2016], or for non-linear models. Acknowledgments Most analyses were performed using the R statistical language [R Core Team, 2020]. Data availability All methods discussed in this manuscript are implemented in the ldsep R package, available on GitHub (https://github.com/dcgerard/ldsep) under a GPL-3 license. Scripts to reproduce the results of this research are available on GitHub (https://github.com/dcgerard/ldfast_sims). All datasets used in this manuscript are publicly available [Uitdewilligen et al., 2013] and may be downloaded from: • https://doi.org/10.1371/journal.pone.0062355.s004 • https://doi.org/10.1371/journal.pone.0062355.s007 • https://doi.org/10.1371/journal.pone.0062355.s009 • https://doi.org/10.1371/journal.pone.0062355.s010 5 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint https://github.com/dcgerard/ldsep https://github.com/dcgerard/ldfast_sims https://doi.org/10.1371/journal.pone.0062355.s004 https://doi.org/10.1371/journal.pone.0062355.s007 https://doi.org/10.1371/journal.pone.0062355.s009 https://doi.org/10.1371/journal.pone.0062355.s010 https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ 5 Figures 0 0.25 0.81 2 4 6 8 10 100 1000 10 100 1000 10 100 1000 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Sample Size ρ̂2 method MLE MoM Naive Figure 1: Estimate of ρ2 (y-axis) for the maximum likelihood estimator [Gerard, 2020] (MLE), our new moment-based estimator (6) (MoM), and the naive squared sample correlation coefficient between posterior mean genotypes (Naive). The x-axis indexes the sample size, the row-facets index the ploidy, and the column-facets index the true ρ2, which is also presented by the horizontal dashed red line. These simulations were performed using a read-depth of 10, and major allele frequencies of 0.5 at each locus. The naive estimator presents a strong attenuation bias toward 0, particularly for higher ploidy regimes. 6 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ ẑ ∆̂′ ∆̂ ρ̂ ρ̂ 2 0.0 0.2 0.4 0.6 0 1 2 0.00 0.05 0.10 0.15 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.0 0.5 1.0 1.5 0.00 0.05 0.10 0.00 0.25 0.50 0.75 1.00 MAD of Estimates M e d ia n o f S ta n d a rd E rr o rs n and ρ other n = 10, ρ = 0.9 Figure 2: Median of estimated standard errors (y-axis) versus median absolute deviations (x-axis) of each of the moment-based LD estimators (facets). The line is the y = x line, and points above this line indicate that the estimated standard errors are typically larger than the true standard errors. Estimated standard error are reasonably unbiased except for ρ̂ and ρ̂2 in scenarios with small sample sizes (n = 10) and a large levels of LD (ρ = 0.9) (color and shape). 0 200 400 600 1.00 1.25 1.50 1.75 2.00 Reliability Ratio Estimate co u n t Figure 3: Histogram of estimated reliability ratios (S69) using the data from Uitdewilligen et al. [2013]. 7 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ MLE MoM Naive M L E M o M N a ive 0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00 0.0 0.2 0.4 0.6 0.8 0 20 40 60 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Figure 4: Pairs plot for ρ2 estimates between the twenty SNPs from Uitdewilligen et al. [2013] with the largest estimated reliability ratios when using either maximum likelihood estimation (MLE) [Gerard, 2020], our new moment-based approach (6) (MoM), or the naive approach using just posterior means (Naive). The dashed line is the y = x line. The MLE and the moment-based approach result in much more similar LD estimates. 8 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ Supplementary Material S1 Derivation of LD estimators In this section, we derive estimators (5)–(7). We do this by assuming a normal model on the data and the genotypes. This is obviously not appropriate when using genotypes and sequencing data, but our simulations in Section 3.1 were also accomplished using sequencing data and resulted in very good performance. Let Gi = (GiA,GiB) ᵀ be the genotype for individual i at loci A and B. Let Zi = (ZiA,ZiB) ᵀ be the data for individual i at loci A and B. Then we let Gi ∼ N2(µ, Σ), and (S1) Zi|Gi ∼ N2(Gi,S), where (S2) µ = (µ1,µ2) ᵀ, (S3) Σ = ( σ11 σ12 σ12 σ22 ) , and (S4) S = ( s11 0 0 s22 ) . (S5) To interpret these terms, µ1/K and µ2/K are the allele frequencies at each locus, σ11 and σ22 are the variances of the genotypes at each locus, s11 and s22 are the variances of the genotyping errors at each locus, and σ12 is covariance between genotypes. By elementary methods, we have the well-known result that, marginally, Zi ∼ N2(µ, Σ + S). (S6) We assume the user has provided posterior moments on the genotypes XiA = E[GiA|ZiA],XiB = E[GiB|ZiB],YiA = var(GiA|ZiA), and YiB = var(GiB|ZiB). (S7) These posterior moments are marginal in that they only condition on either ZiA or ZiB, but not both. Thus, we assume they are well-approximated by the model GiA ∼ N(µ1,σ11) (S8) ZiA|GiA ∼ N(GiA,s11) (S9) GiB ∼ N(µ2,σ22) (S10) ZiB|GiB ∼ N(GiB,s22). (S11) By standard methods, this results in GiA|ZiA ∼ N [( 1 σ11 + 1 s11 )−1 ( 1 σ11 µ1 + 1 s11 ZiA ) , ( 1 σ11 + 1 s11 )−1] , and (S12) GiB|ZiB ∼ N [( 1 σ22 + 1 s22 )−1 ( 1 σ22 µ2 + 1 s22 ZiB ) , ( 1 σ22 + 1 s22 )−1] . (S13) 9 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ Treating only Zi as random from distribution (S6), we have uxA ≈ E [( 1 σ11 + 1 s11 )−1 ( 1 σ11 µ1 + 1 s11 ZiA )] (S14) = ( 1 σ11 + 1 s11 )−1 ( 1 σ11 µ1 + 1 s11 E[ZiA] ) (S15) = ( 1 σ11 + 1 s11 )−1 ( 1 σ11 µ1 + 1 s11 µ1 ) (S16) = µ1. (S17) Similarly, uxB ≈ µ2. (S18) Furthermore, vxA ≈ var [( 1 σ11 + 1 s11 )−1 ( 1 σ11 µ1 + 1 s11 ZiA )] (S19) = ( 1 σ11 + 1 s11 )−2 1 s211 var(ZiA) (S20) = ( 1 σ11 + 1 s11 )−2 σ11 + s11 s211 (S21) = ( 1 σ11 + 1 s11 )−1 σ11 s11 . (S22) Similarly, vxB ≈ ( 1 σ22 + 1 s22 )−1 σ22 s22 . (S23) Now, using the posterior variances, we have uyA ≈ ( 1 σ11 + 1 s11 )−1 , and (S24) uyB ≈ ( 1 σ22 + 1 s22 )−1 . (S25) 10 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ Finally, the expectation of the sample covariance of posterior means is cx ≈ cov [( 1 σ11 + 1 s11 )−1 ( 1 σ11 µ1 + 1 s11 ZiA ) , ( 1 σ22 + 1 s22 )−1 ( 1 σ22 µ2 + 1 s22 ZiB )] (S26) = ( 1 σ11 + 1 s11 )−1 ( 1 σ22 + 1 s22 )−1 1 s11 1 s22 cov(ZiA,ZiB) (S27) = ( 1 σ11 + 1 s11 )−1 ( 1 σ22 + 1 s22 )−1 1 s11 1 s22 σ12. (S28) Using a method-of-moments approach, we now have a system of five equations and five un- knowns: vxA = ( 1 σ11 + 1 s11 )−1 σ11 s11 , (S29) vxB = ( 1 σ22 + 1 s22 )−1 σ22 s22 , (S30) uyA = ( 1 σ11 + 1 s11 )−1 , (S31) uyB = ( 1 σ22 + 1 s22 )−1 , and (S32) cx = ( 1 σ11 + 1 s11 )−1 ( 1 σ22 + 1 s22 )−1 1 s11 1 s22 σ12. (S33) Solving for s11, s22, σ11, σ22, and σ12, we obtain: ŝ11 = uyA(uyA + vxA) vxA (S34) ŝ22 = uyB(uyB + vxB) vxB (S35) σ̂11 = uyA + vxA (S36) σ̂22 = uyB + vxB (S37) σ̂12 = uyA + vxA vxA uyB + vxB vxB cx. (S38) Using (S14)–(S18), we also have µ̂1 = uxA, and (S39) µ̂2 = uxB. (S40) The LD coefficient estimates (5)–(7) can be obtained by substituting in parameter estimates in 11 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ the following equations [Gerard, 2020] ∆ = σ12/K, (S41) ρ = σ12/ √ σ11σ22, and (S42) ∆′ = ∆/∆m, where (S43) ∆m = { min{µ1µ2, (K −µ1)(K −µ2)}/K2 if ∆ < 0, and min{µ1(K −µ2), (K −µ1)µ2}/K2 if ∆ > 0. (S44) S2 Derivation of standard errors Let Mi := (XiA,X 2 iA,XiB,X 2 iB,XiAXiB,YiA,YiB) ᵀ. (S45) Then, by the central limit theorem, we have for M̄ := 1 n n∑ i=1 Mi, (S46) that √ nM̄ is asymptotically multivariate normal with some limiting covariance, say, Ω. Finite variances are guaranteed by the finite support of the genotypes. We can estimate Ω with the sample covariance matrix Ω̂ := 1 n− 1 n∑ i=1 (Mi −M̄)(Mi −M̄)ᵀ. (S47) Estimators (5)–(7) are approximately functions of M̄. Namely ∆̂ ≈ ( M̄6 + M̄2 −M̄21 M̄2 −M̄21 )( M̄7 + M̄4 −M̄23 M̄4 −M̄23 )( M̄5 −M̄1M̄3 K ) (S48) ρ̂ ≈ (√ M̄6 + M̄2 −M̄21 M̄2 −M̄21 )(√ M̄7 + M̄4 −M̄23 M̄4 −M̄23 )( M̄5 −M̄1M̄3 ) (S49) ∆̂′ ≈ ( M̄6 + M̄2 −M̄21 M̄2 −M̄21 )( M̄7 + M̄4 −M̄23 M̄4 −M̄23 )( M̄5 −M̄1M̄3 K ) /∆̂m, where (S50) ∆̂m = { min{M̄1M̄3, (K −M̄1)(K −M̄3)}/K2 if M̄5 −M̄1M̄3 < 0, and min{M̄1(K −M̄3), (K −M̄1)M̄3}/K2 if M̄5 −M̄1M̄3 > 0. (S51) These are smooth functions of M̄ (except on a space of Lebesgue measure zero), and so admit the 12 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ following gradients, calculated in Mathematica [Wolfram Research, Inc., 2020]: g∆ := d∆̂ dM̄ =   −( M̄41 M̄3−2M̄1M̄5M̄6+M̄ 2 1 M̄3(−2M̄2+M̄6)+M̄2M̄3(M̄2+M̄6))(M̄ 2 3−M̄4−M̄7) K(M̄21−M̄2) 2 (M̄23−M̄4) (M̄1M̄3−M̄5)M̄6(M̄23−M̄4−M̄7) K(M̄21−M̄2) 2 (M̄23−M̄4) −( M̄21−M̄2−M̄6)(−2M̄3M̄5M̄7+M̄1(M̄ 4 3 +M̄ 2 3 (−2M̄4+M̄7)+M̄4(M̄4+M̄7))) K(M̄21−M̄2)(M̄ 2 3−M̄4) 2 (M̄1M̄3−M̄5)(M̄21−M̄2−M̄6)M̄7 K(M̄21−M̄2)(M̄ 2 3−M̄4) 2 (−M̄21 +M̄2+M̄6)(−M̄ 2 3 +M̄4+M̄7) K(M̄21−M̄2)(M̄ 2 3−M̄4) (−M̄1M̄3+M̄5)(−M̄23 +M̄4+M̄7) K(M̄21−M̄2)(M̄ 2 3−M̄4) (−M̄1M̄3+M̄5)(−M̄21 +M̄2+M̄6) K(M̄21−M̄2)(M̄ 2 3−M̄4)   , (S52) gρ := dρ̂ dM̄ =   (M̄31 M̄5+M̄ 2 1 M̄3(−M̄2+M̄6)+M̄2M̄3(M̄2+M̄6)−M̄1M̄5(M̄2+2M̄6)) √ −M̄23 +M̄4+M̄7 (M̄21−M̄2) 2 (M̄23−M̄4) √ −M̄21 +M̄2+M̄6 (M̄1M̄3−M̄5)(M̄21−M̄2−2M̄6) √ −M̄23 +M̄4+M̄7 2(M̄21−M̄2) 2 (M̄23−M̄4) √ −M̄21 +M̄2+M̄6 − √ −M̄21 +M̄2+M̄6(M̄1M̄ 2 3 (M̄4−M̄7)−M̄1M̄4(M̄4+M̄7)+M̄3M̄5(−M̄ 2 3 +M̄4+2M̄7)) (M̄21−M̄2)(M̄ 2 3−M̄4) 2√ −M̄23 +M̄4+M̄7 (M̄1M̄3−M̄5) √ −M̄21 +M̄2+M̄6(M̄ 2 3−M̄4−2M̄7) 2(M̄21−M̄2)(M̄ 2 3−M̄4) 2√ −M̄23 +M̄4+M̄7√ −M̄21 +M̄2+M̄6 √ −M̄23 +M̄4+M̄7 (M̄21−M̄2)(M̄ 2 3−M̄4) (−M̄1M̄3+M̄5) √ −M̄23 +M̄4+M̄7 2(M̄21−M̄2)(M̄ 2 3−M̄4) √ −M̄21 +M̄2+M̄6 (−M̄1M̄3+M̄5) √ −M̄21 +M̄2+M̄6 2(M̄21−M̄2)(M̄ 2 3−M̄4) √ −M̄23 +M̄4+M̄7   , (S53) 13 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ and g∆′ := d∆̂ dM̄ = g∆/∆̂m −A, where (S54) A =   ∆̂C1(M̄1,M̄3,M̄5)/∆̂ 2 m 0 ∆̂C3(M̄1,M̄3,M̄5)/∆̂ 2 m 0 0 0 0   (S55) C1(M̄1,M̄3,M̄5) =   M̄3/K 2 if M̄5 < M̄1M̄3 and M̄1M̄3 < (K −M̄1)(K −M̄3) −(K −M̄3)/K2 if M̄5 < M̄1M̄3 and M̄1M̄3 > (K −M̄1)(K −M̄3) −M̄3/K2 if M̄5 > M̄1M̄3 and M̄1(K −M̄3) > (K −M̄1)M̄3 (K −M̄3)/K2 if M̄5 > M̄1M̄3 and M̄1(K −M̄3) < (K −M̄1)M̄3 (S56) C3(M̄1,M̄3,M̄5) =   M̄1/K 2 if M̄5 < M̄1M̄3 and M̄1M̄3 < (K −M̄1)(K −M̄3) −(K −M̄1)/K2 if M̄5 < M̄1M̄3 and M̄1M̄3 > (K −M̄1)(K −M̄3) (K −M̄1)/K2 if M̄5 > M̄1M̄3 and M̄1(K −M̄3) > (K −M̄1)M̄3 −M̄1/K2 if M̄5 > M̄1M̄3 and M̄1(K −M̄3) < (K −M̄1)M̄3 (S57) Though these gradients are rather complicated, they are not computationally intensive and may be calculated in constant time in the sample size. The asymptotic variances of ∆̂, ρ̂, and ∆̂′ are 1 n g ᵀ ∆Ω̂g∆, 1 n gᵀρΩ̂gρ, and 1 n g ᵀ ∆′ Ω̂g∆′, (S58) respectively. To accommodate missing data, we use only pairwise complete observations for the sample covariance matrix (S47). This ensures that Ω̂ is positive definite and, thus, the resulting stan- dard errors are non-negative. However, we use all non-missing observations for M̄. That is, let 14 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ ΘA, ΘB ⊆{1, 2, . . . ,n} be the index sets of non-missing values at loci A and B, respectively. Then M̄1 = 1 |ΘA| ∑ i∈ΘA XiA (S59) M̄2 = 1 |ΘA| ∑ i∈ΘA X2iA (S60) M̄3 = 1 |ΘB| ∑ i∈ΘB XiB (S61) M̄4 = 1 |ΘB| ∑ i∈ΘB X2iB (S62) M̄5 = 1 |ΘA ∩ ΘB| ∑ i∈ΘA∩ΘB XiAXiB (S63) M̄6 = 1 |ΘA| ∑ i∈ΘA YiA (S64) M̄7 = 1 |ΘB| ∑ i∈ΘB YiB (S65) M̄ ∗ = 1 |ΘA ∩ ΘB| ∑ i∈ΘA∩ΘB Mi (S66) Ω̂ = 1 |ΘA ∩ ΘB|− 1 ∑ i∈ΘA∩ΘB (Mi −M̄ ∗ )(Mi −M̄ ∗ )ᵀ (S67) The asymptotic variances of ∆̂, ρ̂, and ∆̂′ are then 1 |ΘA ∩ ΘB| g ᵀ ∆Ω̂g∆, 1 |ΘA ∩ ΘB| gᵀρΩ̂gρ, and 1 |ΘA ∩ ΘB| g ᵀ ∆′ Ω̂g∆′, (S68) respectively. S3 Adjusting the reliability ratios S3.1 Adaptive shrinkage on the reliability ratios Each SNP has an estimated reliability ratio, bj := uyj + vxj vxj , (S69) which corresponds to the multiplicative adjustment to all LD estimates that include that SNP (see (5)). These reliability ratios might have high variance due to (i) lower sequencing depth or (ii) containing fewer individuals with non-missing data. Thus, some reliability ratios may be noisy. Hierarchical shrinkage is a statistical technique that allows high-variance observations to borrow strength from low-variance observations and thus improve estimation performance. Adap- tive shrinkage (ash) [Stephens, 2016] is a recently proposed general-purpose hierarchical shrinkage 15 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ technique that we can use to model the distribution of reliability ratios flexibly, only constraining them to be unimodal. In this section, we will use ash to improve our reliability ratio estimates. We will now describe the procedure for applying ash to shrink the reliability ratios. Our strategy will be to derive the standard errors for the log of the reliability ratios (S69) and apply ash on the log-scale using these standard errors. To begin, let Xij be the posterior mean for individual i at SNP j. Let Yij be the posterior variance for individual i at SNP j. Finally, let Mij = (Xij,X 2 ij,Yij), (S70) M̄j = 1 n n∑ i=1 Mij, so (S71) M̄j1 = 1 n n∑ i=1 Xij, (S72) M̄j2 = 1 n n∑ i=1 X2ij, and (S73) M̄j3 = 1 n n∑ i=1 Yij. (S74) Then the log of the reliability ratio for SNP j is Lj := log ( M̄j3 + M̄j2 −M̄2j1 M̄j2 −M̄2j1 ) (S75) = log(M̄j3 + M̄j2 −M̄2j1) − log(M̄j2 −M̄ 2 j1). (S76) Let the sample covariance be Ω̂j := 1 n− 1 n∑ i=1 (Mij −M̄j)(Mij −M̄j)ᵀ. (S77) Then we have by the central limit theorem that √ nM̄j is asymptotically multivariate normal, and we can use Ω̂j as the estimate of the covariance matrix. The gradients for (S75) are gj1 := dLj dM̄j1 = −2M̄j1 M̄j3 + M̄j2 −M̄2j1 + 2M̄j1 M̄j2 −M̄2j1 (S78) gj2 := dLj dM̄j2 = 1 M̄j3 + M̄j2 −M̄2j1 − 1 M̄j2 −M̄2j1 (S79) gj3 := dLj dM̄j3 = 1 M̄j3 + M̄j2 −M̄2j1 (S80) Then, with gj := (gj1,gj2,gj3) ᵀ, the variance for Lj is ŝ2j := 1 n g ᵀ j Ω̂jgj. (S81) 16 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ We apply ash to (L1, ŝ1), . . . , (Lm, ŝm) to obtain shrunken log reliability ratios L̂1, . . . , L̂m. Because ash’s grid-based scheme for estimating the mode is not the most computationally efficient, we used the half-sample mode estimator of Robertson and Cryer [1974] prior to running ash. This procedure seems to result in improved performance for SNPs with unusually variable reli- ability ratios (Figure S1). S3.2 Thresholding the reliability ratios If a researcher accidentally provides a monoallelic SNP, its reliability ratio could explode due to having a denominator close to zero in (S69). For example, the right panel of Figure S2 contains a monoallelic SNP (PotVar0080327) whose reliability ratio estimate (S69) is 100.92. This can provide unstable estimates of LD as some SNPs will, due to sampling variability, have correlations with these monoallelic SNPs on the order of 0.01. For example, the sample correlation between posterior means of PotVar0080327 and PotVar0078678 (left facet of Figure S2) -0.0098. But due to the extreme reliability ratio of PotVar0080327, the genotype-error adjusted correlation estimate is -1. This is, of course, unsettling. So by default, our software will take all reliability ratio estimates (S69) above a user-provided value (default of 10) and assign these to have reliability ratios of the median reliability ratio in the dataset. S4 Genome-wide Association Studies In this section, we demonstrate that the techniques used in Section S1, when applied to simple linear regression with an additive effects model [Rosyara et al., 2016], result in the standard ordinary least squares estimate when using the posterior mean as a covariate. This indicates that for genome-wide association studies, using the posterior mean is appropriate in a linear regression context when using an additive model for gene action. Let Gi be the genotype for individual i at a locus. Let Zi be the data that lead to the genotyping for individual i at the same locus. Let Wi be some quantitative trait of interest for individual i. Then we let Wi|Gi ∼ N(β0 + β1Gi,σ2) (S82) Zi|Gi ∼ N(Gi,s2) (S83) Gi ∼ N(µ,τ2). (S84) We suppose the user is only provided the posterior means and variances of each Gi|Zi. Let Xi = E[Gi|Zi] and Yi = var(Gi|Zi). From elementary methods, we have Zi ∼ N(µ,s2 + τ2) (S85) Gi|Zi ∼ N [( 1 τ2 + 1 s2 )−1 ( 1 τ2 µ + 1 s2 Zi ) , ( 1 τ2 + 1 s2 )−1] . (S86) 17 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ Let uw = 1 n n∑ i=1 Wi (S87) ux = 1 n n∑ i=1 Xi (S88) cxw = 1 n− 1 n∑ i=1 (Wi −uw)(Xi −ux) (S89) vx = 1 n− 1 n∑ i=1 (Xi −ux)2 (S90) vw = 1 n− 1 n∑ i=1 (Wi −uw)2. (S91) We have that cwx ≈ cov(Wi,Xi) (S92) ≈ cov ( Wi, ( 1 τ2 + 1 s2 )−1 ( 1 τ2 µ + 1 s2 Zi )) (S93) = ( 1 τ2 + 1 s2 )−1 1 s2 cov(Wi,Zi) (S94) = ( 1 τ2 + 1 s2 )−1 1 s2 β1 var(Gi) (S95) = ( 1 τ2 + 1 s2 )−1 τ2 s2 β1. (S96) We also have from (S19)–(S22) that vx ≈ ( 1 τ2 + 1 s2 )−1 τ2 s2 . (S97) Using method of moments with equations (S96) and (S97), we have the following estimator for β1 β̂1 = cwx/vx (S98) = cwx√ vxvw √ vw√ vx . (S99) Equation (S99) is the sample correlation between the Wi’s and the Xi’s (cwx/ √ vxvw) multiplied by the ratio of the sample standard deviations of the Wi’s and the Xi’s ( √ vw/ √ vx). This is the well-known formula for the ordinary least squares estimate of β1 from a regression of Wi on Xi. 18 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ S5 Supplementary figures 0.0 0.1 0.2 0.3 0.0 0.2 0.4 0.6 Log of Reliability Ratio E st im a te d S ta n d a rd E rr o r (A) 0 25 50 75 100 125 0 25 50 75 100 125 Alternative Counts R e fe re n ce C o u n ts (B) 0 40 80 120 0 40 80 120 Alternative Counts R e fe re n ce C o u n ts (C) 1.00 1.25 1.50 1.75 2.00 1.00 1.25 1.50 1.75 2.00 Raw Reliability Ratio S h ru n ke n R e lia b ili ty R a tio (D) Figure S1: (A) The log of the reliability ratios (x-axis) versus their estimated standard errors (y-axis). The two highlighted points do not seem to fit the trend. When we plot the read-counts for these highlighted points ((B) and (C)), we notice that these two SNPs are almost monoallelic, providing doubts on their unusually large reliability ratios. We plot the shrunken reliability ratios (y-axis) against their original values (x-axis) in (D), noting that the problem SNPs (color) have their reliability ratios highly adjusted. 19 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ PotVar0078678 PotVar0080327 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 Alternative Counts R e fe re n ce C o u n ts Figure S2: Plots of read-counts of two SNPs (facets) from Uitdewilligen et al. [2013]. Alternative counts lie on the x-axis and reference counts lie on the y-axis. The right SNP is monoallelic and because of this the estimated correlation between the two SNPs using raw reliability ratios is -1, even though the sample correlation between posterior means is only -0.0098. 20 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ References A. Brown. Sample sizes required to detect linkage disequilibrium between two or three loci. Theoreti- cal Population Biology, 8(2):184 – 201, 1975. ISSN 0040-5809. doi: 10.1016/0040-5809(75)90031-3. R. L. Carter and W. A. Fuller. Instrumental variable estimation of the simple errors-in- variables model. Journal of the American Statistical Association, 75(371):687–692, 1980. doi: 10.1080/01621459.1980.10477534. J. S. Degracie and W. A. Fuller. Estimation of the slope and analysis of covariance when the concomitant variable is measured with error. Journal of the American Statistical Association, 67 (340):930–937, 1972. doi: 10.1080/01621459.1972.10481321. K. K. Dey and M. Stephens. CorShrink: Empirical Bayes shrinkage estimation of correlations, with applications. bioRxiv, 2018. doi: 10.1101/368316. E. A. Fox, A. E. Wright, M. Fumagalli, and F. G. Vieira. ngsLD: evaluating linkage disequilibrium using genotype likelihoods. Bioinformatics, 35(19):3855–3856, 03 2019. ISSN 1367-4803. doi: 10.1093/bioinformatics/btz200. W. A. Fuller. Measurement error models. John Wiley & Sons, 2009. D. Gerard. Pairwise linkage disequilibrium estimation for polyploids. bioRxiv, 2020. doi: 10.1101/2020.08.03.234476. D. Gerard and L. F. V. Ferrão. Priors for genotyping polyploids. Bioinformatics, 36(6):1795–1800, 11 2019. ISSN 1367-4803. doi: 10.1093/bioinformatics/btz852. bioRxiv: 751784. D. Gerard, L. F. V. Ferrão, A. A. F. Garcia, and M. Stephens. Genotyping polyploids from messy sequencing data. Genetics, 210(3):789–807, 2018. ISSN 0016-6731. doi: 10.1534/ge- netics.118.301468. W. Hill and A. Robertson. Linkage disequilibrium in finite populations. Theoretical and applied genetics, 38(6):226–231, 1968. doi: 10.1007/BF01245622. T. C. Koopmans. Linear regression analysis of economic time series, volume 20. De erven F. Bohn nv, 1937. R. Lewontin. The interaction of selection and linkage. i. general considerations; heterotic models. Genetics, 49(1):49, 1964. URL https://www.genetics.org/content/49/1/49. R. C. Lewontin and K.-i. Kojima. The evolutionary dynamics of complex polymorphisms. Evolution, 14(4):458–472, 1960. doi: 10.1111/j.1558-5646.1960.tb03113.x. T. Maruki and M. Lynch. Genome-wide estimation of linkage disequilibrium from population- level high-throughput sequencing data. Genetics, 197(4):1303–1313, 2014. ISSN 0016-6731. doi: 10.1534/genetics.114.165514. M. Pal. Consistent moment estimators of regression coefficients in the presence of errors in vari- ables. Journal of Econometrics, 14(3):349 – 364, 1980. ISSN 0304-4076. doi: 10.1016/0304- 4076(80)90032-9. R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2020. URL https://www.R-project.org/. T. Robertson and J. D. Cryer. An iterative procedure for estimating the mode. Journal of the Amer- ican Statistical Association, 69(348):1012–1016, 1974. doi: 10.1080/01621459.1974.10480246. U. R. Rosyara, W. S. De Jong, D. S. Douches, and J. B. Endelman. Software for genome-wide association studies in autopolyploids and its application to potato. The Plant Genome, 9(2), 2016. doi: 10.3835/plantgenome2015.08.0037. M. Slatkin. Linkage disequilibrium-understanding the evolutionary past and mapping the medical future. Nature Reviews Genetics, 9(6):477, 2008. doi: 10.1038/nrg2361. 21 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint http://doi.org/10.1016/0040-5809(75)90031-3 http://doi.org/10.1080/01621459.1980.10477534 http://doi.org/10.1080/01621459.1980.10477534 http://doi.org/10.1080/01621459.1972.10481321 http://doi.org/10.1101/368316 http://doi.org/10.1093/bioinformatics/btz200 http://doi.org/10.1093/bioinformatics/btz200 http://doi.org/10.1101/2020.08.03.234476 http://doi.org/10.1101/2020.08.03.234476 http://doi.org/10.1093/bioinformatics/btz852 http://doi.org/10.1534/genetics.118.301468 http://doi.org/10.1534/genetics.118.301468 http://doi.org/10.1007/BF01245622 https://www.genetics.org/content/49/1/49 http://doi.org/10.1111/j.1558-5646.1960.tb03113.x http://doi.org/10.1534/genetics.114.165514 http://doi.org/10.1534/genetics.114.165514 http://doi.org/10.1016/0304-4076(80)90032-9 http://doi.org/10.1016/0304-4076(80)90032-9 https://www.R-project.org/ http://doi.org/10.1080/01621459.1974.10480246 http://doi.org/10.3835/plantgenome2015.08.0037 http://doi.org/10.1038/nrg2361 https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ C. Spearman. The proof and measurement of association between two things. The American journal of psychology, 15(1):72–101, 1904. doi: 10.2307/1422689. M. Stephens. False discovery rates: a new deal. Biostatistics, 18(2):275–294, 10 2016. ISSN 1465- 4644. doi: 10.1093/biostatistics/kxw041. J. A. Sved and W. G. Hill. One hundred years of linkage disequilibrium. Genetics, 209(3):629–636, 2018. ISSN 0016-6731. doi: 10.1534/genetics.118.300642. J. G. A. M. L. Uitdewilligen, A.-M. A. Wolters, B. B. D’hoop, T. J. A. Borm, R. G. F. Visser, and H. J. van Eck. A next-generation sequencing method for genotyping-by-sequencing of highly heterozygous autotetraploid potato. PLOS ONE, 8(5):1–14, 05 2013. doi: 10.1371/jour- nal.pone.0062355. X. Wen and M. Stephens. Using linear predictors to impute allele frequencies from summary or pooled genotype data. The annals of applied statistics, 4(3):1158–1182, 2010. ISSN 1932-6157. doi: 10.1214/10-aoas338. Y. C. J. Wientjes, R. F. Veerkamp, and M. P. L. Calus. The effect of linkage disequilibrium and family relationships on the reliability of genomic prediction. Genetics, 193(2):621–631, 2013. ISSN 0016-6731. doi: 10.1534/genetics.112.146290. Wolfram Research, Inc. Mathematica, Version 12.2, 2020. URL https://www.wolfram.com/ mathematica. Champaign, IL. X. Zhu and M. Stephens. Large-scale genome-wide enrichment analyses identify new trait-associated genes and pathways across 31 human phenotypes. Nature communications, 9(1):1–14, 2018. doi: 10.1038/s41467-018-06805-x. 22 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430270doi: bioRxiv preprint http://doi.org/10.2307/1422689 http://doi.org/10.1093/biostatistics/kxw041 http://doi.org/10.1534/genetics.118.300642 http://doi.org/10.1371/journal.pone.0062355 http://doi.org/10.1371/journal.pone.0062355 http://doi.org/10.1214/10-aoas338 http://doi.org/10.1534/genetics.112.146290 https://www.wolfram.com/mathematica https://www.wolfram.com/mathematica http://doi.org/10.1038/s41467-018-06805-x http://doi.org/10.1038/s41467-018-06805-x https://doi.org/10.1101/2021.02.08.430270 http://creativecommons.org/licenses/by-nc-nd/4.0/ Introduction Methods Results Simulations LD estimates for Solanum tuberosum Discussion Figures Derivation of LD estimators Derivation of standard errors Adjusting the reliability ratios Adaptive shrinkage on the reliability ratios Thresholding the reliability ratios Genome-wide Association Studies Supplementary figures