Skip to main content

Estimating linkage disequilibrium from genotypes under Hardy-Weinberg equilibrium



Measures of linkage disequilibrium (LD) play a key role in a wide range of applications from disease association to demographic history estimation. The true population LD cannot be measured directly and instead can only be inferred from genetic samples, which are unavoidably subject to measurement error. Previous studies of r2 (a measure of LD), such as the bias due to finite sample size and its variance, were based on the special case that the true population-wise LD is zero. These results generally do not hold for non-zero \( {r}_{true}^2 \) values, which are more common in real genetic data.


This work generalises the estimation of r2 to all levels of LD, and for both phased and unphased data. First, we provide new formulae for the effect of finite sample size on the observed r2 values. Second, we find a new empirical formula for the variance of the observed r2, equals to 2E[r2](1 − E[r2])/n, where n is the diploid sample size. Third, we propose a new routine, Constrained ML, a likelihood-based method to directly estimate haplotype frequencies and r2 from diploid genotypes under Hardy-Weinberg Equilibrium. While serving the same purpose as the pre-existing Expectation-Maximisation algorithm, the new routine can have better convergence and is simpler to use. A new likelihood-ratio test is also introduced to test for the absence of a particular haplotype. Extensive simulations are run to support these findings.


Most inferences on LD will benefit from our new findings, from point and interval estimation to hypothesis testing. Genetic analyses utilising r2 information will become more accurate as a result.



Linkage Disequilibrium (LD) was first defined about 100 years ago as the non-random association of alleles at different loci [1]. Since that time there has been much research on the topic, some focused on how LD is quantified and defined [2,3,4,5,6,7], and a larger fraction on the connection between LD and various evolutionary forces that shape it, including genetic drift [8,9,10,11] and selection [12, 13]. These investigations have also extended to sub-divided or structured populations [14,15,16,17]. In principle, these theoretical works allow one to infer features of the underlying processes from measures of LD [18, 19]. Another application of LD includes association studies to identify genes for diseases, such as in the Human Haplotype Map project [20]. With the advance in sequencing technology, computer packages have been developed to calculate LD for large numbers of samples and genetic loci [5, 21,22,23,24]. While there are plenty of applications utilising LD information, they all rely on accurate and robust estimation of the parameter of interest, which many have taken for granted. There are, however, key gaps regarding LD estimation that have yet to be resolved.

The squared correlation coefficient r2 is a popular measure of LD alongside D or D’ [1]. One advantage of r2 is that it is less sensitive to marginal allele frequencies. It also relates to the ϕ correlation coefficient and χ2 test statistic for association of contingency tables [25]. Further, Sved and Feldman [26] showed the equivalence of r2 and the probability of linked identity by decent between two random-chosen haplotypes. Most previous studies concerning the estimation of r2, including the mean and variance, have been based on the assumption of linkage equilibrium (i.e. \( {r}_{true}^2=0 \)). These findings do not hold for real datasets where the true correlation between loci is non-zero. In this article we extend the theory of r2 estimation to all levels of LD. We first study the expectation of the observed r2 for finite sample size, as sampling is known to bias the observed r2 [27]. Second, we approximate the empirical variance of the observed r2 as a function of sample size and its expectation E[r2]. Third, we propose a direct routine to estimate haplotype frequencies and r2 for unphased data under Hardy-Weinberg Equilibrium (HWE). Throughout this paper, we define \( {r}_{true}^2 \) as the true population-wise, unobserved LD between two loci, while \( {r}_{phased}^2 \) and \( {r}_{unphased}^2 \) as the raw squared coefficient computed directly from phased and unphased data respectively.

Effect of finite sample size

Consider a classical two-allele, two-locus scenario, with alleles A and a at the first locus and alleles B and b at the second. Let pAB, pAb, paB, pab be the true haplotype frequencies of the four haplotype combinations AB, Ab, aB, ab. Statistically speaking, if samples are taken with replacement, the observed haplotype counts follow a multinomial distribution with size 2n and probabilities equal the true haplotype frequencies. Let \( \overset{\sim }{p_{AB}},\overset{\sim }{p_{Ab}},\overset{\sim }{p_{aB}},\overset{\sim }{p_{ab}} \) be the sampled haplotype frequencies from our genetic samples, which are also the maximum likelihood estimators (MLE) for the true haplotype frequencies. We also let \( {r}_{phased}^2 \) be the squared correlation computed directly using the observed frequencies:

$$ {r}_{phased}^2=\frac{{\left(\overset{\sim }{p_{AB}}\overset{\sim }{p_{ab}}-\overset{\sim }{p_{Ab}}\overset{\sim }{p_{aB}}\right)}^2}{\overset{\sim }{p_A}\ \left(1-\overset{\sim }{p_A}\right)\overset{\sim }{p_B}\left(1-\overset{\sim }{p_B}\ \right)} $$

where \( \overset{\sim }{p_A}=\overset{\sim }{p_{AB}}+\overset{\sim }{p_{Ab}} \) and \( \overset{\sim }{p_B}=\overset{\sim }{p_{AB}}+\overset{\sim }{p_{aB}} \) are the observed marginal allele frequencies for allele A and B. Note that this formula is identical to the square of the ϕ coefficient for a two-by-two contingency table [25, 28]. The invariant principle of MLE suggests that \( {r}_{phased}^2 \) is also the MLE for \( {r}_{true}^2 \), but does not guarantee its unbiasedness towards the parameter of interest. The next step is to establish the effect of sample size and find a formula to connect \( {r}_{phased}^2 \) and \( {r}_{true}^2 \).

Sved and Feldman [26] showed the expected change in r2 due to genetic drift over two successive generations is

$$ E\left[{r}_{t+1}^2\right]=\frac{1}{2{N}_e}+\left(1-\frac{1}{2{N}_e}\right){\left(1-c\right)}^2{r}_t^2 $$

with c being the recombination rate between a pair of loci and Ne the effective population size. This equation is seemingly irrelevant to our problem, but we may consider the sampling process as another generation of genetic drift with population size equal to the sample size 2n under complete linkage (c = 0). Therefore, given the true \( {r}_{true}^2 \) for a population, the expected observed \( {r}_{phased}^2 \) becomes:

$$ E\left[{r}_{phased}^2\right]=\frac{1}{2n}+\left(1-\frac{1}{2n}\right){r}_{true}^2 $$

Or when we estimate the underlying \( {r}_{true}^2 \) from an observed value \( {r}_{phased}^2 \) the sample size correction formula becomes:

$$ \hat{r_{true}^2}=\frac{r_{phased}^2-\frac{1}{2n}}{1-\frac{1}{2n}} $$

For unphased data, the sample size correction should largely follow the phased case, with n replacing 2n:

$$ E\left[{r}_{unphased}^2\right]=\frac{1}{n}+\left(1-\frac{1}{n}\right){r}_{true}^2 $$

and similarly if we estimate the underlying \( {r}_{true}^2 \) from the estimated haplotype frequencies, the sample size correction formula is:

$$ \hat{r_{true}^2}=\left({r}_{unphased}^2-\frac{1}{n}\right)/\left(1-\frac{1}{n}\right) $$

Empirical variance of r 2

r2 is a ratio hence its variance is difficult to evaluate. The variance is required when inferring the confidence interval (C.I.) of an r2 estimate from a pair of loci, or in hypothesis testing to test against a specific true value. Many existing applications, such as those for effective population size estimation, suggest that the observed \( \frac{r^2}{E\left[{r}^2\right]} \) is approximately χ2 distributed with 1 degree of freedom, which implies that var(r2) ≈ 2E[r2]2 [27, 29]. This expression is derived under the null distribution of the χ2 statistic and is only correct if the underlying \( {r}_{true}^2=0 \). An obvious counter-example is a pair of perfectly correlated loci, whose \( {r}_{true}^2 \) and observed \( {r}_{phased}^2 \) (or \( {r}_{unphased}^2 \)) are 1, and hence the variance is 0 (instead of 2). While a closed-form expression for the variance may not exist, we will approximate it with empirical simulations and relate it to sample sizes and other factors.

Estimating haplotype frequencies from unphased data

The term LD is often called the “gametic phase disequilibrium”, which specifically refers the correlation of alleles at the haplotype level. For diploid individuals, however, direct inference of haplotype frequencies is usually impossible when gametic phase is not known. The reason is that we are unable to tell the exact haplotype configuration for double heterozygotes, as they can be AB/ab or Ab/aB. Under HWE the expected frequencies for each genotype f1, f2, …, f9 are shown in Table 1. As introduced by Hill in 1974, the log-likelihood with respect to the haplotype frequencies, is [2]:

$$ l\left({p}_{AB},{p}_{Ab},{p}_{aB},{p}_{ab}\right)= constant+\sum \limits_{i=1}^9{n}_i\log \left({f}_i\right) $$

where n1, n2, …, n9 are the counts for each genotype. It is easy to understate the challenges in maximising this log-likelihood. Direct maximisation of Eq. 7 is not always feasible, hence the use of Expectation-Maximisation (EM) algorithm was suggested [21]. The second approach, adapted by CubeX, calculates the first derivataes of Eq. 7 and solves the associated cubic equation. This however works only for the two-allele two-locus case. 

Table 1 Expected genotypic frequencies under HWE

Here we propose a new approach to directly maximise Eq. 7 and thus to estimate the haplotype frequencies. Without loss of generality we drop the term pab as the four haplotype frequencies must add to one. The feasible region of the remaining three haplotype frequencies looks like a tetrahedron with vertices (1, 0, 0), (0, 1, 0), (0, 0, 1), and (0, 0, 0). Our method, called Constrained ML, transforms the haplotype frequencies before maximising the log-likelihood function. For this two-allele two-locus scenario, the transformation is as follows:

$$ {\displaystyle \begin{array}{l}u={p}_{AB}+{p}_{Ab}+{p}_{aB}\\ {}v=\frac{p_{AB}+{p}_{Ab}}{p_{AB}+{p}_{Ab}+{p}_{aB}}\\ {}w=\frac{p_{AB}}{p_{AB}+{p}_{Ab}}\end{array}} $$

The feasible region of the new coordinates {u, v, w} becomes a unit cube. The log-likelihood is then maximised with respect to the new coordinates in this “box-like” constraint, where a number of common optimisation routines become available. The MLE for the haplotype frequencies can be obtained by back transforming the \( \left\{\hat{u},\hat{v},\hat{w}\right\} \) values which maximise the function.

Sometimes, we need to decide whether a haplotype actually exists in the population. For example, if n6 = n8 = n9 = 0 then we cannot rule out the possibility of pab = 0, even if the estimated frequency is not. The same principle applies to the other haplotypes. While CubeX provides an additional solution (denoted as the γ solution) should this happen, it gives little indication of which set of estimated haplotype frequency we should accept. Under this scenario, we propose to perform a likelihood-ratio test (LRT), to test whether a particular haplotype has zero frequency as a precaution. This use of an LRT will be demonstrated in the analysis of a real dataset.


The plots of \( {r}_{phased}^2 \) versus \( {r}_{true}^2 \) are shown in Fig. 1 for several sample sizes. Linear regressions were run through these simulated data points, and the estimates and confidence intervals (C.I.s) of the slopes and intercepts are summarised in Table 2. The estimates of intercepts and slopes were very close to 1/2n and (1 − 1/2n), which agree to our derivation for r2 under finite sample size \( E\left[{r}_{phased}^2\right]=\left(1-\frac{1}{2n}\right){r}_{true}^2+\frac{1}{2n} \) in Eq. 3. In particular, the 95% C.I. for slopes excluded 1 for all examined cases. The results from the same study using genotypic (unphased) data are shown in Fig. 2 and Table 3. The results were similar to the phased case, with estimates of intercepts and slopes of about 1/n and (1 − 1/n) respectively. In short, both phased and unphased simulations followed closely our theoretical expectations of observed r2 due to the effect of finite sampling.

Fig. 1

Plots of \( {r}_{phased}^2 \) against \( {r}_{true}^2 \) under different sample sizes: 20 (top left), 40 (top right), 60 (bottom left), and 80 (bottom right). A linear regression (red line) is fitted to each plot and the estimates are reported in Table 2

Table 2 Slope and intercept estimates from phased data
Fig. 2

Plots of \( {r}_{unphased}^2 \) against \( {r}_{true}^2 \) under different sample sizes: 20 (top left), 40 (top right), 60 (bottom left), and 80 (bottom right). A linear regression (red line) is fitted to each plot and the estimates are reported in Table 3. Simulation setting is described in text

Table 3 Slope and intercept estimates from unphased data

Figures 3 and 4 show the variance plots against their expectations for phased and unphased data. The variance decreased with sample size n. Under the same condition the variances were smaller for phased than unphased data. The variance generally increases with their expectations for E[r2] < 0.5, and then come down afterwards. As predicted, the variance goes down to 0 when E[r2] approaches 1.

Fig. 3

Plots of variance of \( {r}_{phased}^2 \) against \( E\left[{r}_{phased}^2\right] \) under different sample sizes: 20 (top left), 40 (top right), 60 (bottom left), and 80 (bottom right). The red lines shows the functional form of \( 2E\left[{r}_{phased}^2\right]\left(1-E\left[{r}_{phased}^2\right]\right)/n \)

Fig. 4

Plots of variance of \( {r}_{unphased}^2 \) against \( E\left[{r}_{unphased}^2\right] \) under different sample sizes: 20 (top left), 40 (top right), 60 (bottom left), and 80 (bottom right). The red lines shows the functional form of \( 2E\left[{r}_{unphased}^2\right]\left(1-E\left[{r}_{unphased}^2\right]\right)/n \)

Our final set of simulations compared the convergence between Constrained ML and the EM algorithm. For both methods, the maximisation terminated at the kth iteration when l(k + 1) − l(k) / max(|l(k + 1)|, |l(k)|, 1) was smaller than the chosen relative tolerance. The plots of relative log-likelihood against relative tolerance are found in Fig. 5. The global maximum of the log-likelihood surface will have the relative value of 1, and all other points will have values smaller than 1. For very loose relative tolerance of 10−2 Constrained ML was inaccurate. Between 10−3 and 10−6 Constrained ML converged better than the EM algorithm, and the two methods performed equally well for 10−7 and smaller. The IF index in Fig. 6 measures the differences between the estimated and true haplotype frequencies, with a value of 1 referring to the scenario when the two are identical. Figure 6 suggests that the two methods behaved similarly for tighter tolerance (10−6 and beyond). The IF for Constrained ML was also more predictable and stable, while there was greater variability for the EM algorithm.

Fig. 5

Plots of relative log-likelihood against relative tolerance for the two maximisation routines using unphased data: the EM algorithm (black circles), and Constrained ML (red crosses). Four different sample sizes were examined: 20 (top left), 40 (top right), 60 (bottom left), and 80 (bottom right). The global maximum of the log-likelihood has the relative value of 1

Fig. 6

Plots of IF index against relative tolerance for the two maximisation routines using unphased data: the EM algorithm (black circles), and Constrained ML (red crosses). Four different sample sizes were examined: 20 (top left), 40 (top right), 60 (bottom left), and 80 (bottom right)

To demonstrate the use of Constrained ML and the relevant LRT we analysed a published dataset on APOE [30]. The dataset consists of 9 loci from 80 human individuals whose haplotypes were experimentally identified. We masked the haplotype phase (i.e. as if we obtained their genotypes only) and tried to estimate the haplotype counts for all 36 pairs of loci, and when required, to conduct a LRT to test for the absence of a particular haplotype.

The complete results are presented in Additional file 1, with selected summary in Table 4. For comparison, the results from CubeX and MIDAS (representing the EM algorithm) are also presented [24]. MIDAS was able to correctly estimate the haplotype counts for 28 out of 36 pairs of loci. CubeX provided unique and correct estimates for 26 cases. Additionally in 5 other cases, CubeX provided two solutions, one of which was the correct one. Constrained ML also gave unique and correct haplotype estimates for the same 26 cases as CubeX. LRT were run on the 5 remaining cases that potentially have only 3 haplotypes. LRT made the correct decision on 4 cases (loci pair 1–5, 4–8, 5–7, and 5–9. See Table 4), but falsely rejected the correct answer for loci pair 1–9. To summarise, Constrained ML and LRT jointly provided correct haplotype count estimates for 30 cases.

Table 4 Selected results from the analysis of APOE dataset


Effect of finite sample size

The theoretical derivation and computer simulations both suggest the observed \( E\left[{r}^2\right]=\frac{1}{s}+\left(1-\frac{1}{s}\right){r}_{true}^2 \), where s = n sampled diploid individuals for unphased data, and s = 2n for haplotypic data. This is different from most existing formulae, which have the form of \( E\left[{r}^2\right]={r}_{true}^2+ correction\ factor \) [7, 27]. The explanation is that most previous derivations were based on the null distribution of the χ2 statistic for association [30], or equivalently assuming \( {r}_{true}^2=0 \) [29]. These corrections become less reliable when \( {r}_{true}^2>0 \). For the limiting case of completely linked loci, our sample size correction (Eqs. 4 and 6) guarantees that the implied \( \hat{r_{true}^2} \) is also 1, while the existing form over-corrects for sample size [16]. Further, Tables 2 and 3 show that the slope estimates are significantly different from 1, and thus the term (1 − 1/s) should be retained. Although the difference can sometimes be small, it is conceptually important that all the squared correlation coefficients must be bounded between 0 and 1. As pointed out by Sved et al. [7], the exact expression for sample size corrections may contain o(s−2) terms, but are shown to be negligible here.

Empirical variance of r 2

We pointed out earlier that most existing claims about the variance of r2 are based on \( {r}_{true}^2=0 \) and do not apply to a wider range of \( {r}_{true}^2 \) values. The second simulation investigated empirically the variance of the observed \( {r}_{phased}^2 \) and \( {r}_{unphased}^2 \) against their expectations and under various sample sizes. The variance plots in Figures 3 and 4 look like parabolas, in which the variances first increase and peak at E[r2] = 0.5 and then come down for larger values. Empirically speaking, the variances go like 2E[r2](1 − E[r2])/n for most \( {r}_{true}^2>0 \) as modelled by the red lines in the plots. It is expected that the two marginal frequencies may play a role in the variance, but the exact expression is too complicated to be evaluated. This approximate formula provides a quick and direct way to approximate the variances and subsequently the confidence intervals of r2. In addition, this formula helps predict the gain in precision by phasing the data or by increasing the sample size.

Estimating haplotype frequencies from unphased data

This work proposes a new routine, Constrained ML, to estimate haplotype frequencies from genotypes under HWE. In theory, Constrained ML, EM, and CubeX all aim to maximise the same Hill 1974 log-likelihood function and hence should be identical. In reality they may produce inconsistent results because of the different ways of maximisation. CubeX estimates the haplotype frequencies by solving the cubic equation for the two-locus two-allele case. It may return two sets of answers which are both real and biologically feasible, and this is particularly common when the sample size is small, or when the loci depart from HWE [5]. Another explanation of having multiple answers is that being a root of the cubic equation is only a necessary condition for maximising the likelihood. It is unfortunate that CubeX does not provide any indications on which set of haplotype frequencies we should accept, other than using our “prior knowledge of the LD structure” [5]. For the more general case with multiple alleles, the EM algorithm was introduced because direct maximisation was not always available. It experiences other computing challenges, for example, if pABpab + pAbpaB = 0 in any intermediate E-step, the computation halts as division by zero is not permitted. The method is also known to be sensitive to initial conditions, and often to converge to a local rather than the global maximum [21]. With our new method, Constrained ML, the same log-likelihood can be directly maximised within the transformed feasible region. Optimisation within this box-like constraint is a well-studied problem with many routines available across platforms and programming languages, such as L-BFGS-B used in this study. The last simulation compared the convergence between EM and Constrained ML under different sample sizes and stopping criterion. The two methods performed similarly for very tight tolerance for a simple two-allele two-locus setting. A looser relative tolerance is normally implemented in real applications to balance between accuracy and computing time, and in this case Constrained ML produced better convergence than the EM algorithm. Additionally, like the EM algorithm, Constrained ML can handle loci with multiple alleles, by transforming haplotype frequencies into higher-dimensional “cubes” (Additional file 2). The idea of the LRT can also be extended to multiallelic cases to test for the absence of any particular haplotypes. Further comparisons of these methods, especially under more challenging conditions, would be welcome. The APOE dataset, with reasonable sample size and often extreme haplotype counts, illustrates the use of Constrained ML and the associated LRT in real applications. The EM-based MIDAS, which provides one estimate a time, got the least correct cases. Although CubeX apparently gave more correct haplotype count estimates, there were 5 ambiguous cases with two solutions. For the 5 loci pairs that potentially have only 3 haplotypes instead of 4, LRT correctly identified the answer in 4 cases, but marginally rejected the correct answer for the loci pair 1–9 at 5% α level (LRT statistic = 1.84, p-value = 0.17). We should also point out estimation errors were rare but unavoidable, and this is exactly why phased data is preferred. Nonetheless, LRT provides a valuable metric to help decide which set of answer we should accept.

There exist some other methods, such as the Burrows’ method [3, 7], to estimate r2 without assuming HWE, but they are beyond the scope of this work. Burrows’ Δ measures the so-called composite linkage disequilibrium from non-gametic frequencies, which takes the departure from HWE into account. One can further break down the nine genotypes into eight parameters to include the single-locus disequilibria and higher-order disequilibria [31]. On the downside, they are not as efficient as the likelihood estimators if the HWE assumption is valid.


This work generalised the estimation of r2 to all levels of LD, and for both phased and unphased data. New formulae were provided to correct for finite sample size during r2 point estimation. We approximated the empirical variance of r2 based on computer simulations. Lastly, a new framework called Constrained ML was suggested to directly estimate haplotype frequencies from diploid genotypic data under HWE. Most inferences utilising LD information will benefit from our new findings.


Computer simulation 1: effect of finite sample size

Simulations were run to verify whether the effect of finite sample size on r2 estimates is the same as described by Eqs. 3 and 5. First, to ensure most haplotype combinations are covered, a set of true haplotype frequencies was drawn randomly from the uniform Dirichlet(1, 1, 1, 1) distribution, which was used to calculate the underlying \( {r}_{true}^2 \). Second, haplotypes were sampled with a known sample size via the multinomial distribution, and the observed \( {r}_{phased}^2 \) were calculated via Eq. 1. For unphased case, two haplotypes were paired into one genotype. Haplotype frequencies and \( {r}_{unphased}^2 \) were estimated through Constrained ML. These two steps were repeated for 10,000 times per sample size, and further repeated for sample sizes of 20, 40, 60, and 80 diploid individuals. The observed \( {r}_{phased}^2 \) and \( {r}_{unphased}^2 \) were plotted against \( {r}_{true}^2 \) for each sample size.

Computer simulation 2: empirical variance of r 2

Another set of simulations was run to explore the empirical variance of \( {r}_{phased}^2 \) (or \( {r}_{unphased}^2 \)). The procedure was very similar to the first simulation. For each \( {r}_{true}^2 \), 500 additional samples were simulated to calculate the variance of the observed \( {r}_{phased}^2 \) (or \( {r}_{unphased}^2 \)). This was repeated for 10,000 different sets of true haplotype frequencies per sample size, and further repeated for sample sizes of 20, 40, 60, and 80 diploid individuals.

Computer simulation 3: estimating haplotype frequencies from unphased data

The final set of simulations studied the convergence of Constrained ML and the EM algorithm against different stopping criterion and sample sizes. We measured convergence by two metrics, the relative log-likelihood [32], and the IF index [21]. For each simulation, true haplotype frequencies were drawn from the Dirichlet(1, 1, 1, 1) distribution, which were then used to sample the genotypes with a known sample size. Two haplotypes were randomly paired up to form a genotype. All initially fixed/extinct loci were discarded and resampled. Then the log-likelihood function (Eq. 5) was maximised via Constrained ML and the EM algorithm. In particular, Constrained ML was optimised by the L-BFGS-B routine within the optim() function in R [33, 34]. To avoid false convergence under a specific initial condition, 100 initial conditions were applied to each set of genotypes and the estimate with the largest maximised log-likelihood was used. The whole simulation was repeated 500 times, and further repeated for several different sample sizes and stopping criterion. We used relative tolerance as our stopping criteria, ranging between 10−2 and 10−9.

Availability of data and materials

The raw APOE dataset can be found in the original publication [30]. Computer simulations and mathematical derivations are replicable per instructed in the main text. All computer codes are available upon request. An online program to implement Constrained ML can be found at



Confidence interval




Hardy-Weinberg Equilibrium


Linkage disequilibrium


Likelihood-ratio test


Maximum-Likelihood estimator/estimate


  1. 1.

    Sved JA, Hill WG. One hundred years of linkage disequilibrium. Genetics. 2018;209(3):629–36.

    PubMed  PubMed Central  Google Scholar 

  2. 2.

    Hill WG. Estimation of linkage disequilibrium in randomly mating populations. Heredity (Edinb). 1974;33(2):229–39.

    CAS  Article  Google Scholar 

  3. 3.

    Weir BS. Inferences about linkage disequilibrium. Biometrics. 1979;35(1):235–54.

    CAS  PubMed  Article  Google Scholar 

  4. 4.

    Lewontin RC. On measures of gametic disequilibrium. Genetics. 1988;120(3):849–52.

    CAS  PubMed  PubMed Central  Google Scholar 

  5. 5.

    Gaunt TR, Rodríguez S, Day IN. Cubic exact solutions for the estimation of pairwise haplotype frequencies: implications for linkage disequilibrium analyses and a web tool’CubeX’. BMC Bioinformatics. 2007;8(1):1.

    Article  CAS  Google Scholar 

  6. 6.

    Rogers AR, Huff C. Linkage disequilibrium between loci with unknown phase. Genetics. 2009;182(3):839–44.

    PubMed  PubMed Central  Article  Google Scholar 

  7. 7.

    Sved JA, Cameron EC, Gilchrist AS. Estimating effective population size from linkage disequilibrium between unlinked loci: theory and application to fruit fly outbreak populations. PLoS One. 2013;8(7):e69078.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  8. 8.

    Hill WG, Robertson A. Linkage disequilibrium in finite populations. Theor Appl Genet. 1968;38(6):226–31.

    CAS  PubMed  Article  Google Scholar 

  9. 9.

    Ohta T, Kimura M. Linkage disequilibrium at steady state determined by random genetic drift and recurrent mutation. Genetics. 1969;63(1):229.

    CAS  PubMed  PubMed Central  Google Scholar 

  10. 10.

    Hudson R. The sampling distribution of linkage disequilibrium under an infinite allele model without selection. Genetics. 1985;109(3):611–31.

    CAS  PubMed  PubMed Central  Google Scholar 

  11. 11.

    Tenesa A, Navarro P, Hayes BJ, Duffy DL, Clarke GM, Goddard ME, et al. Recent human effective population size estimated from linkage disequilibrium. Genome Res. 2007;17(4):520–6.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  12. 12.

    Lewontin RC. The interaction of selection and linkage. I. General considerations; heterotic models. Genetics. 1964;49(1):49–67.

    CAS  PubMed  PubMed Central  Google Scholar 

  13. 13.

    Stephan W, Song YS, Langley CH. The hitchhiking effect on linkage disequilibrium between linked neutral loci. Genetics. 2006;172(4):2647–63.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  14. 14.

    Nei M, Li WH. Linkage disequilibrium in subdivided populations. Genetics. 1973;75(1):213–9.

    CAS  PubMed  PubMed Central  Google Scholar 

  15. 15.

    Vitalis R, Couvet D. Estimation of effective population size and migration rate from one- and two-locus identity measures. Genetics. 2001;157(2):911–25.

    CAS  PubMed  PubMed Central  Google Scholar 

  16. 16.

    Sved JA, McRae AF, Visscher PM. Divergence between human populations estimated from linkage disequilibrium. Am J Hum Genet. 2008;83(6):737–43.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  17. 17.

    McEvoy BP, Powell JE, Goddard ME, Visscher PM. Human population dispersal “out of Africa” estimated from linkage disequilibrium and allele frequencies of SNPs. Genome Res. 2011;21(6):821–9.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  18. 18.

    Hartl DL, Clark AG. Principles of population genetics; 1998.

    Google Scholar 

  19. 19.

    Charlesworth B, Charlesworth D. Elements of evolutionary genetics. Colorado: Roberts and Company Publishers; 2010.

  20. 20.

    Thorisson GA, Smith AV, Krishnan L, Stein LD. The international HapMap project web site. Genome Res. 2005;15(11):1592–3.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  21. 21.

    Excoffier L, Slatkin M. Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol Biol Evol. 1995;12(5):921–7.

    CAS  PubMed  Google Scholar 

  22. 22.

    Zhao JH. 2LD, GENECOUNTING and HAP: computer programs for linkage disequilibrium analysis. Bioinformatics. 2004;20(8):1325–6.

    CAS  PubMed  Article  Google Scholar 

  23. 23.

    Barrett JC, Fry B, Maller J, Daly MJ. Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics. 2005;21(2):263–5.

    CAS  Article  PubMed  Google Scholar 

  24. 24.

    Gaunt TR, Rodriguez S, Zapata C, Day IN. MIDAS: software for analysis and visualisation of interallelic disequilibrium between multiallelic markers. BMC Bioinformatics. 2006;7(1):227.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  25. 25.

    Sheskin DJ. Handbook of parametric and nonparametric statistical procedures. Florida: CRC Press; 2003.

    Google Scholar 

  26. 26.

    Sved J, Feldman M. Correlation and probability methods for one and 2 loci. Theor Popul Biol. 1973;4(1):129–32.

    CAS  PubMed  Article  Google Scholar 

  27. 27.

    Hill WG. Estimation of effective population-size from data on linkage disequilibrium. Genet Res. 1981;38(3):209–16.

    Article  Google Scholar 

  28. 28.

    Warrens MJ. On similarity coefficients for 2× 2 tables and correction for chance. Psychometrika. 2008;73(3):487.

    PubMed  PubMed Central  Article  Google Scholar 

  29. 29.

    Haldane J. The association of characters as a result of inbreeding and linkage. Ann Eugenics. 1949;15(1):15–23.

    CAS  Article  Google Scholar 

  30. 30.

    Orzack SH, Gusfield D, Olson J, Nesbitt S, Subrahmanyan L, Stanton VP. Analysis and exploration of the use of rule-based algorithms and consensus methods for the inferral of haplotypes. Genetics. 2003;165(2):915–28.

    PubMed  PubMed Central  Google Scholar 

  31. 31.

    Weir B. Linkage disequilibrium and association mapping. Annu Rev Genomics Hum Genet. 2008;9:129–42.

    CAS  PubMed  Article  Google Scholar 

  32. 32.

    Fallin D, Schork NJ. Accuracy of haplotype frequency estimation for biallelic loci, via the expectation-maximization algorithm for unphased diploid genotype data. Am J Hum Genet. 2000;67(4):947–59.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  33. 33.

    Byrd R, Lu P, Nocedal J, Zhu C. A limited memory algorithm for bound constrained optimization. SIAM J Sci Comput. 1995;16(5):1190–208.

    Article  Google Scholar 

  34. 34.

    R Core Team. R: a language and environment for statistical computing. Vienna: R Foundation for Statistical Computing; 2013.

    Google Scholar 

Download references


We thank S.M. O’Loughlin for useful discussions.


Authors received funding from Target Malaria, which receives core funding from the Bill & Melinda Gates Foundation and from the Open Philanthropy Project Fund, an advised fund of Silicon Valley Community Foundation. These funding bodies have had no direct role in the design of the study nor in the collection, analysis, interpretation of data and in the writing of the manuscript.

Author information




T-YJH & AB conducted the research. T-YJH performed computer analysis. T-YJH & AB wrote the manuscript. Both authors have read and approved the manuscript.

Corresponding author

Correspondence to Tin-Yu J. Hui.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1.

Complete results from the analysis of APOE dataset.

Additional file 2.

Generalisation of Constrained ML to multiallelic loci.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Hui, TY.J., Burt, A. Estimating linkage disequilibrium from genotypes under Hardy-Weinberg equilibrium. BMC Genet 21, 21 (2020).

Download citation


  • Linkage disequilibrium
  • Maximum likelihood estimation
  • Sampling error