Bayesian estimates of linkage disequilibrium
- Paola Sebastiani^{1}Email author and
- María M Abad-Grau^{2}Email author
https://doi.org/10.1186/1471-2156-8-36
© Sebastiani and Abad-Grau; licensee BioMed Central Ltd. 2007
Received: 04 February 2007
Accepted: 25 June 2007
Published: 25 June 2007
Abstract
Background
The maximum likelihood estimator of D' – a standard measure of linkage disequilibrium – is biased toward disequilibrium, and the bias is particularly evident in small samples and rare haplotypes.
Results
This paper proposes a Bayesian estimation of D' to address this problem. The reduction of the bias is achieved by using a prior distribution on the pair-wise associations between single nucleotide polymorphisms (SNP)s that increases the likelihood of equilibrium with increasing physical distances between pairs of SNPs. We show how to compute the Bayesian estimate using a stochastic estimation based on MCMC methods, and also propose a numerical approximation to the Bayesian estimates that can be used to estimate patterns of LD in large datasets of SNPs.
Conclusion
Our Bayesian estimator of D' corrects the bias toward disequilibrium that affects the maximum likelihood estimator. A consequence of this feature is a more objective view about the extent of linkage disequilibrium in the human genome, and a more realistic number of tagging SNPs to fully exploit the power of genome wide association studies.
Keywords
Background
Single nucleotide polymorphisms (SNPs) are an invaluable resource to identify regions of the human genome that may be associated with disease. A key to this process is linkage disequilibrium (LD) that is defined as the non-random association between the alleles of SNPs [1]. Although LD may occur between SNPs that are not in linkage but are associated, we will focus on the LD due to the spatial structure of the genome. In this situation, the non-random association implies that pairs of alleles in the same haplotype occur differently from what we would expect in a random pairing and several measures of LD have been proposed to capture the departure from independent pairing of the alleles of SNPs [2].
In this paper we will limit attention to D, its normalized version D', and the well known bias of the Maximum Likelihood Estimate (MLE) of D' toward disequilibrium [2, 3]. This bias is particularly large in small samples and SNPs with rare alleles to the point that SNPs whose alleles occur independently may be inferred to be in strong LD [4]. However, relying on small samples to identify patterns of LD is not unusual: for example, the International HapMap Project (IHMP) aims to establish genome-wide patterns of LD using genotype data of at most 30 trios or 45 unrelated individuals [5]. The genotype data typed in this small number of samples are used to describe the extent of LD in the human genome, and derive a map of the haplotypes and the SNPs that are sufficient to tag the human genome. These results will have a deep impact on genome wide association studies and, in particular, inferring larger blocks of LD than the real ones may lead to the selection of an insufficient number of SNPs and hence decrease the power of genome wide association studies. In this scenario, biasing the estimate of LD toward equilibrium appears to be a safer alternative.
Several solutions have been proposed to reduce the bias of the MLE of D' toward disequilibrium [4]. A pragmatic solution is to impose some "ad hoc" threshold on the minimum allele frequency (MAF) of those SNPs that can be used to infer the pattern of LD [6]. Imposing this threshold leads to a non-random selection of SNPs and may introduce ascertainment bias [7, 8]. The thought behind our approach is that the bias of the MLEs of D and D' is due to the lack of information in the data to discriminate between equilibrium and different magnitude of disequilibrium, and any attempt to correct this bias is due to fail as it was acknowledged in [3]. However, it is known that, on average, the strength of LD due to linkage decreases as the physical distance between SNPs increases [9, 10]. Therefore, we propose a Bayesian estimator of D' that allows us to integrate data with prior information about the pattern of LD decay. To this end, we use a prior distribution on the pairwise dependencies between different SNPs that is a decreasing function of their physical distance. We show how to compute the posterior estimate of D' using Markov Chain Monte Carlo methods, and provide a numerical approximation that can be used for fast estimation of LD in large regions of the human genome. As we show in simulated and real data from the IHMP, the effect of the prior distribution is to drastically reduce the bias toward disequilibrium even in small samples, and to remove the need of arbitrary thresholds on the MAF. We also show that, compared to the MLE, our estimators lead to infer patterns of LD decay that are much closer to published results [10], and confirms the existence of haplotype blocks as regions of low recombination. The method is implemented in a computer program called Bayesian Linkage (BLink) [11].
Results and discussion
The traditional D and D'
Given two SNPs L_{1} and L_{2}, with alleles A/a and B/b, A and B the major alleles, we define the probability of the haplotype ij by p_{ ij }= _{ p }(L_{1} = i, L_{2} = j), i = A, a, j = B, b. As in [12], we assume the relation p_{ A }≥ p_{ B }on the probabilities p_{ A }= p(L_{1} = A), p_{ B }= p(L_{2} = B) from which the inequality p_{ Ab }≥ p_{ aB }follows.
The two SNPs are in linkage equilibrium when the co-occurrence of two alleles on the same haplotype is random, e.g. p_{ ij }= p_{ i }p_{ j }for all i = A, a, j = B, b. On the other hand, LD implies some form of dependency in the alleles on the same haplotype and hence departure from independence of the probabilities p_{ ij }. Although there are many ways to measure departure from independence in a 2 × 2 table [13], a widely used measure of LD is the parameter D defined by
D = p_{ AB }- p_{ A }p_{ B }- p_{ a }p_{ b }≤ D ≤ p_{ a }p_{ B }.
${{D}^{\prime}}_{s}$ is defined in the interval [-1, 1], with ${{D}^{\prime}}_{s}$ = ± 1 describing perfect disequilibrium and ${{D}^{\prime}}_{s}$ = 0 describing equilibrium. It is also common to take the absolute value of ${{D}^{\prime}}_{s}$, say D' = |${{D}^{\prime}}_{s}$| to have a measure in the interval [0,1]. This is for example the default measure of LD in the popular program HaploView [6].
Maximum likelihood estimation
where I(x ∈ X) is the indicator function defined as I(x ∈ X) = 1 if x ∈ X and 0 otherwise. Note that:
• $\widehat{D}=-{\widehat{p}}_{a}{\widehat{p}}_{b}$ whenever n_{ ab }= 0, so that ${{\widehat{D}}^{\prime}}_{s}$ = -1 and ${\widehat{D}}^{\prime}$ = 1;
• $\widehat{D}={\widehat{p}}_{a}{\widehat{p}}_{B}$ whenever n_{ aB }= 0, so that ${{\widehat{D}}^{\prime}}_{s}$ = 1 and ${\widehat{D}}^{\prime}$ = 1.
These two facts determine the bias toward disequilibrium of ${{D}^{\prime}}_{s}$ and D' that can lead to infer that two SNPs are in disequilibrium when they are actually in equilibrium. For example, the expected number of haplotypes ab in a sample of n haplotypes between two SNPs in equilibrium is np_{ a }p_{ b }. If both p_{ a }and p_{ b }are smaller than 0.1, then the minimum sample size n that yields an expected number of haplotypes ab greater than 1 is 100, and this number goes up to 400 when both p_{ a }and p_{ b }are 0.05. Therefore, data from small samples and rare alleles do not provide information to discriminate between equilibrium and disequilibrium, while the MLE of D' returns its maximum value consistent with perfect disequilibrium. This situation is well known and supported by simulation studies. Teare and coauthors [4] showed the extent of this bias through extensive simulations in which they examined the effect of sample size, MAF and strength of LD on the MLE of D'. Their study suggested that the bias is severe in small samples (less than 100 subjects), when both alleles are rare (MAF less than 0.05), and the two SNPs are in equilibrium. An "ad hoc" solution consists of disregarding those SNPs with a MAF below some threshold. Although this reduces the bias of the MLEs, it introduces an ascertainment bias that may impact the inferred pattern of LD [7].
Bayesian approach
in which the prior hyper-parameters α_{ ij }are updated into α_{ ij }+ n_{ ij }. The prior means of the parameters p_{ ij }are α_{ ij }/α_{ T }, where α_{ T }= ∑_{ ij }α_{ ij }. The posterior means become E(p_{ ij }|n) = (α_{ ij }+ n_{ ij })/(α_{ T }+ n) and can be used as point estimates of the parameters. Furthermore, the posterior distributions of the marginal probabilities p_{ A }and p_{ B }follow Beta distributions with hyper-parameters (α_{ A }+ n_{ A }, α_{ a }+ n_{ a }) and (α_{ B }+ n_{ B }, α_{ b }+ n_{ b }), for α_{ A }= α_{ AB }+ α_{ Ab }, α_{ a }= α_{ T }- α_{ A }, α_{ B }= α_{ AB }+ α_{ aB }, and α_{ b }= α_{ T }- α_{ B }.
The inference on the parameters D, D_{ s }and D' is more complex. First, we note that we can write these parameters as follows:
Choice of the prior distribution
To complete the specification of the Bayesian model, we need to provide values for the hyper-parameters. Because we wish to encode the information that departure from equilibrium of any two SNPs is a function of their physical distance, we define:
α_{ ij }= α(1 - exp(-d))
and, hence, as a weighted average of n_{ ij }/n (the MLE estimates of p_{ ij }) and the prior probabilities 1/4. The first weight n/(n + 4α(1 - exp(-d))) is an increasing function of the sample size n, and a decreasing function of α and d, while the second weight 4α(1 - exp(-d))/(n + 4α(1 - exp(-d))) is an increasing function of α and d, and a decreasing function of n. Therefore, for large sample sizes, the posterior means of p_{ ij }approach the MLEs. This is consistent with the fact that, in large samples, the effect of the prior distribution on the posterior distribution becomes negligible. However, when the distance d decreases, the function 1 - exp(-d) approaches 0, and the weight n/(n + 4α(1 - exp(-d))) approaches 1, so that the Bayesian estimate becomes closer to the MLE. In the limiting case d = 0, or α = 0, the two estimates are identical. For fixed α and increasing distance (essentially d > 0.5Mb), the second weight approaches its maximum value 4α/(n + 4α), and larger values of α further increase the weight given to the prior mean. To contain the effect of the prior distribution, we use α = 1 and simulation studies that are described in the next section show that this choice produces a good trade-off between robustness and bias.
Data are derived from the 30 trios of the CEPH population. Haplotype frequencies between two SNPs S1 and S2 in chromosome 22.
SNP | SNP | S1 | |
---|---|---|---|
S2 | B | b | |
A | 99 | 17 | 116 |
a | 4 | 0 | 4 |
103 | 17 | 120 |
Approximate estimates
The main source of error in this approximation is due to replacing the probability P(D ≥ 0) by the indicator function. When we are in a clear situation of disequilibrium, the probability of the event (D < 0) is almost 0 or 1, and the approximate posterior expectation of D_{ s }and D' approaches the exact values. When p(D < 0) is far from 0 and 1, then the error increases and biases the estimates toward disequilibrium. This is consistent with the fact that the approximation is close to the MLE and therefore suffers of some bias toward disequilibrium. However, we will show with results of simulations in the next section that this bias is smaller. Because of this similarity with the MLE, we will refer to these approximate estimates as the maximum a posteriori (MAP).
Unknown phase
When the genotype data are unphased, the ML estimation uses the EM algorithm to infer the unknown phase given the distribution of known haplotypes [16]. We adopt the same procedure for the calculation of the MAP estimates. Given the frequencies of known haplotypes, n_{ ij }, i = A, a, j = B, b, the algorithm first computes the MAP estimates of the haplotype frequencies p_{ ij }, and then alternates an expectation step to replace the unphased haplotypes by their expected phase and a maximization step to compute the MAP estimates using observed and expected haplotypes. The algorithm typically converges in less than 4 steps. Unknown haplotypes are regarded as missing values in the stochastic analysis, so that they become parameters of the model and are estimated within the Gibbs sampling algorithm. We also note that, when the genotype data are from trios, we use all phased haplotypes to compute the initial frequencies, regardless of whether they are transmitted from parents to offspring.
The method is implemented in the computer program BLink that is developed in C++ and is available from the supplementary web site [11]. The software accepts genotype data from either unrelated individuals or nuclear families consisting of two parents and one child.
Evaluation
We examined the performance of the Bayesian estimator in three groups of simulated data and a real data set derived from the IHMP. All data used in this evaluation are available from the supplementary web site [11].
Materials and methods
Group 1
The objectives of the first simulation study were (1) to compare the performance of the Bayesian estimates and the MLE for different sample sizes and small values of the MAF, and (2) to assess the accuracy of the MAP approximation to the stochastic estimates of D_{ s }and D'. We generated samples of 60, 120, 240 haplotypes, in which we modeled the true D' as D' = exp(-d) for a distance d ranging from 0 to 0.5 Mb. For each value of D' and each sample size, we generated 1,000 samples of haplotypes by using the joint probability of haplotypes defined by Equation (1), with p_{ B }generated from a uniform distribution in the interval [0.5; 0.9) and p_{ A }generated from a uniform distribution in the interval [p_{ B }; 0.95). In each simulated sample, we computed the MLE, and the MAP estimate of D_{ s }, as well as the stochastic estimate of D_{ s }using Gibbs sampling. To compute the stochastic estimates we run the chain for an initial burn-in of 1, 000 iterations and then based the inference on a second sample of 1, 000 iterations. We used as point estimate the median value of the simulated sample and α = 1 in each analysis.
Group 2
In this second set, we generated a sample of 1, 000 individuals in a region of 0.5 Mb with the program MS that simulates genotype data under a variety of neutral models [17]. We considered a population of 1 million individuals, a mutation rate of 10E - 9 per base pair, and a recombination rate of 8 × 10E - 9 between adjacent base pairs per generation. Only 10% of the 8080 SNPs in the sample of 1, 000 individuals were randomly selected and, from this sample, we randomly generated subsamples of sizes 60, 120, 240 and 480 haplotypes. In the absence of "true" values for D', we studied the decay of LD inferred by the MLE and the MAP estimator for increasing physical distances, versus the LD decay inferred in the original sample of 1, 000 individuals. Each point in the plot is the average estimate of D' for all the SNPs within a physical distance of d ± 0.01 Mb. By averaging the LD between pairs of SNPs at increasing distance, these plots are used to summarize the decay of LD over large regions [18, 10]. Ascertainment bias was assessed by repeating the analysis with these thresholds on the MAF: 0, 0.05, 0.1, 0.2. Sensitivity to the prior distribution was assessed by repeating the analysis for α = 0.25, 1, 2, 4.
Group 3
To examine the robustness of the MAP estimators, we also generated data under a different model of allele frequency, linkage disequilibrium and population differentiation that is implemented in the software COSI [19]. We simulated a sample of 1, 000 individuals under the calibrated model for the European population that considers bottlenecks, migration and recombination hotspots spacing 0.085 Mb [19]. We randomly selected 10% of the generated 32452 SNPs and from this sample we randomly selected subsamples of 60, 120, 240 and 480 haplotypes. We produced LD decay plots using the thresholds 0, 0.05, 0.1, 0.2 on the MAF and the range of α values that were used for the analysis of the simulated data in group 2.
Real data
Real data were obtained from the first phase of the IHMP [5]. We used genotype data of the 30 trios of the CEPH and Yoruba in Ibadan, Nigeria and chromosome 22 because its pattern of LD has been widely studied [9]. This chromosome was genotyped in 19120 and 19854 SNPs in the CEPH and Yoruba samples. We produced LD decay plots using the thresholds on the MAF and the range of α values that we used for the analysis of the simulated data in group 2. We also produced more informative graphical displays of pairwise LD, by generating bi-dimensional maps similar to those generated by the program Haploview, but with a lower resolution to enable the display of LD over larger regions. The maps were generated with the program BMapBuilder [15] using the MLE and the MAP estimate of D'.
Results
The LD decay plots in Figure 6 were generated from the data simulated with the program COSI with long range disequilibrium. The plot on the left displays the average LD computed with the MLE (dashed lines), and the MAP estimator with α = 1 (continuous lines), MAF> 0.05, and increasing sample sizes. The bias of the MLE of D' now works in favor of this estimator that is able to reproduce the larger values of D' compared to the MAP estimator. Although of smaller magnitude, the MAP estimate of D' remains consistent with long range disequilibrium. The plot in panel (b) shows the LD decay inferred with the MAP estimator using α = 0.25 (dashed line) and α = 4 (continuous line). The evident departure from the true pattern of LD when α = 4 again suggests choosing a small α value, say α = 1, to limit the bias toward equilibrium.
Figure 7 plots the LD decay generated with the ML and MAP estimators using the CEPH and Yoruba samples of the IHMP, different thresholds on the MAF, and different α values. The results confirm the bias toward disequilibrium of the MLE of D' – panels (a) and (c)- when all the SNPs are used in the analysis. The MAP estimator leads to more consistent results for a range of thresholds on the MAF, see panels (b) and (d). These results suggest that, even with a small sample size, we do not need to select SNPs based on the MAF, thus removing the issue of the ascertainment bias. Panels (e) and (f) display the LD decay in Chromosome 22, Yoruba samples, for α = 0.25 (e) and α = 4 (f). Values of α greater than 1 appear to bias the estimator toward equilibrium, while values of α smaller than 1 lead to a loss of robustness. Consistently with the analysis of simulated data, the choice α = 1 appears to achieve a good balance between robustness and bias reduction.
Conclusion
A good estimation of D' is crucial for a better understanding of patterns of LD, a robust identification of haplotype blocks, more accurate algorithms for haplotype reconstruction, and better reproducibility of genetic studies. The popular MLE of D' is biased toward disequilibrium, and requires the use of thresholds on the MAF that have been shown to introduce ascertainment bias. By using an informative prior that models the LD between SNPs based on their physical distance, we define a Bayesian estimator that outperforms the MLE without increasing computational complexity. Our estimator is slightly biased toward equilibrium, but this bias tends to disappear quickly with increasing sample sizes, and at a faster rate than the bias toward disequilibrium of the MLE. Furthermore, our evaluation shows that the MAP estimator does not require any thresholds on the MAF.
There are several limitations to this work. The probability distribution of the haplotypes is modelled using a multinomial distribution with a Dirichlet prior, and this assumption can be relaxed to include more general models. Also, the prior distribution does not take into account recombination hotspots. We have assessed the impact of this assumption in our simulations, but more evaluation is needed. Our analysis is now limited to biallelic SNPs, however our Bayesian model can be extended to include measures of LD for multi-allelic SNPs. For example, a first-order approximation of the average estimator of D' suggested in [2] can be computed by averaging the MAP estimates. Some more work is needed to examine the effect of the prior hyper-parameters. In future work we will extend our results to other measures of LD, particularly r^{2} = D/(p_{ A }p_{ a }p_{ B }p_{ b }). Some preliminary results that are posted in our supplementary web site suggest that a Bayesian estimator of r^{2} developed along the line of the estimator introduced in this paper would gain robustness.
Declarations
Acknowledgements
Research supported by NIH/NHLBI grant R21 HL080463-01, NIH/NIDDK 1R01DK069646-01A1 and the Spanish research program [projects TIN2004-06204-C03-02 and TIN2005-02516]. Comments of three anonymous reviewers were very helpful to improve the initial version of the manuscript.
Authors’ Affiliations
References
- Wall JD, Pritchard JK: Haplotype Blocks and Linkage Disequilibrium un the Human Genome. Genetics. 2003, 4: 587-597.PubMedGoogle Scholar
- Hedrick PW: Gametic disequilibrium measures: proceed with caution. Genetics. 1987, 117: 331-341.PubMed CentralPubMedGoogle Scholar
- Lewontin RC: On measures of gametic disequilibrium. Genetics. 1988, 120: 849-852.PubMed CentralPubMedGoogle Scholar
- Teare MD, Dunning AM, Durocher F, Rennart G, Easton DF: Sampling Distribution of Summary Linkage Disequilibrium measures. Annals of Human Genetics. 2002, 66: 223-233. 10.1046/j.1469-1809.2002.00108.x.View ArticlePubMedGoogle Scholar
- HapMap-Consortium TI: A haplotype map of the human genome. Nature. 2005, 437: 1299-1320. 10.1038/nature04226.View ArticleGoogle Scholar
- Barrett JC, Maller J, Daly MJ: Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics. 2005, 21: 263-263. 10.1093/bioinformatics/bth457.View ArticlePubMedGoogle Scholar
- Nielsen R, Signorovitch J: Correcting for ascertainment biases when analyzing SNP data: applications to the estimation of linkage disequilibrium. Theoretical Population Biology. 2003, 63: 245-255. 10.1016/S0040-5809(03)00005-4.View ArticlePubMedGoogle Scholar
- Ronald J, Akey JM: Genome-wide scans for loci under selection in humans. Hum Genom. 2005, 2: 113-125.View ArticleGoogle Scholar
- Dawson E, Abecasis G, Bumpstead S, Chen Y, Hunt S, Beare DM, JP , et al: A first-generation linkage disequilibrium map of human chromosome 22. Nature. 2002, 418: 544-548. 10.1038/nature00864.View ArticlePubMedGoogle Scholar
- Reich DE, Cargill M, Bolk S, Ireland J, Sabeti PC, Richter DJ, Lavery T, Kouyoumjian R, Farhadian SF, Ward R, Lander ES: Linkage disequilibrium in the human genome. Nature. 2001, 411: 199-204. 10.1038/35075590.View ArticlePubMedGoogle Scholar
- Sebastiani P, Abad-Grau M: 2007, [http://bios.ugr.es/BLink/supplementary/]
- Morton NE, Zhang W, Taillon-Miller P, Ennis S, Kwok PY, Collins A: The optimal measure of allelic association. Proceedings of the National Academy of the USA. 2001, 98: 5217-5221. 10.1073/pnas.091062198.View ArticleGoogle Scholar
- Bishop Y, Fienberg SE, Holland P: Discrete Multivariate Analysis: Theory and Practicee. 1975, Cambridge, MA: MIT PressGoogle Scholar
- Lewontin RC: The interaction of selection and linkage. 1: general considerations: heterotic models. Genetics. 1964, 49: 49-67.PubMed CentralPubMedGoogle Scholar
- Abad M, Montes R, Sebastiani P: Building Chromosome-Wide LD Maps. Bioinformatics. 2006, 1933-1934. 10.1093/bioinformatics/btl288.Google Scholar
- Ott J: Analysis of human genetic linkage. 1999, Baltimore, MD: John HopkinsGoogle Scholar
- Hudson RR: Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics. 2002, 18: 337-338. 10.1093/bioinformatics/18.2.337.View ArticlePubMedGoogle Scholar
- Dunning A, Durocher F, Healey C, Teare MD, McBride SE, Carlomagno F, Xu CF, Dawson E, Rhodes S, Ueda S, Lai E, Luben RN, Rensburg EJV, Mannermaa A, Kataja V, Rennart G, Dunham I, Purvis I, Easton D, Ponder BA: The extent of linkage disequilibrium in four populations with distinct demographic histories. American Journal of Human Genetics. 2000, 67: 1544-54. 10.1086/316906.PubMed CentralView ArticlePubMedGoogle Scholar
- Schaffner SF, Foo C, Gabriel S, Reich D, Daly MJ, Altshuler D: Calibrating a coalescent simulation of human genome sequence variation. Genome Research. 2005, 15: 1576-1583. 10.1101/gr.3709305.PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.