Hypothesis testing of meiotic recombination rates from population genetic data

Background Meiotic recombination, one of the central biological processes studied in population genetics, comes in two known forms: crossovers and gene conversions. A number of previous studies have shown that when one of these two events is nonexistent in the genealogical model, the point estimation of the corresponding recombination rate by population genetic methods tends to be inflated. Therefore, it has become necessary to obtain statistical evidence from population genetic data about whether one of the two recombination events is absent. Results In this paper, we formulate this problem in a hypothesis testing framework and devise a testing procedure based on the likelihood ratio test (LRT). However, because the null value (i.e., zero) lies on the boundary of the parameter space, the regularity conditions for the large‐sample approximation to the distribution of the LRT statistic do not apply. In turn, the standard chi‐squared approximation is inaccurate. To address this critical issue, we propose a parametric bootstrap procedure to obtain an approximate p‐value for the observed test statistic. Coalescent simulations are conducted to show that our approach yields accurate null p‐values that closely follow the theoretical prediction while the estimated alternative p‐values tend to concentrate closer to zero. Finally, the method is demonstrated on a real biological data set from the telomere of the X chromosome of African Drosophila melanogaster. Conclusions Our methodology provides a necessary complement to the existing procedures of estimating meiotic recombination rates from population genetic data. Electronic supplementary material The online version of this article (doi:10.1186/s12863-014-0122-7) contains supplementary material, which is available to authorized users.


Background
Meiotic recombination is one of the essential evolutionary factors responsible for promoting genetic diversity within species. There are two major types of meiotic recombination events: crossovers and gene conversions. Unlike crossover, which is a reciprocal event, gene conversion is a unidirectional event that involves the transfer of a short segment of one parental chromosome (called a 'conversion tract') to the other parental chromosome. Crossovers and gene conversions play different roles in shaping the pattern of linkage disequilibrium (LD) observed in natural populations: "Recombination between pairs of markers that are far apart are almost exclusively crossovers, whereas pairs of markers that are close together are affected by both crossovers and gene conversion events" [1]. Thus, studying these two biological processes and characterizing their basic parameters has direct implications for population genetic studies.
There is a growing body of work on coalescent-based statistical approaches to jointly estimating the crossover rate, the gene conversion rate, and the mean conversion tract length from population genetic data. Building on a popular framework called the "Product of Approximate Conditionals" (PAC) model [2], Gay et al. [3] have proposed a likelihood-based method to incorporate gene conversion events. Yin et al. [4] have extended and improved the model further by explicitly modeling overlapping gene conversion events. On the flip side of these two frequentist approaches, Bayesian Markov chain Monte Carlo (MCMC) techniques have also been developed to estimate recombination rates from population genetic data [5,6].
Despite the marked progress in the joint estimation of the aforementioned three parameters, these methods are less suitable when one of the two recombination events http://www.biomedcentral.com/1471-2156/ 15/122 is absent in the genealogical model. The corresponding population parameter, especially the gene conversion rate when the gene conversion event is nonexistent, tends to be overestimated by the maximum likelihood (or maximum a posteriori) point estimation. This is unfortunately inevitable because the true parameter value (i.e., zero) lies on the boundary of the parameter space. The use of inaccurate parameters may limit the efficacy of these approaches, and can also hinder population genetic analyses based on these estimators. Therefore, it has become necessary to obtain statistical evidence from population genetic data about whether one of the two recombination events is absent.
The goal of this article is to propose a rigorous procedure to perform hypothesis testing for this problem. Our approach is based on the likelihood ratio test (LRT). One of the classical regularity conditions for the asymptotic distribution of the LRT statistic requires the null value to be an interior point in the parameter space. However, because this condition is not satisfied, it is invalid to apply the standard chi-squared approximation in this setting. We thus develop a parametric bootstrap procedure to obtain an approximate p-value of the observed test statistic. Coalescent-based simulations are conducted to demonstrate the soundness and effectiveness of our approach. The bootstrap estimates of the null p-values closely follow the theoretical prediction, while the estimated alternative p-values tend to concentrate closer to zero. Finally, we apply the method to a real biological data set from the telomere of the X chromosome of African D. melanogaster. The result suggests that while gene conversion is likely to play a leading role in shaping the observed polymorphism in these regions, crossover may not have been greatly suppressed in a short segment of su(w a ) locus.

Methods
We begin by reviewing some previous statistical models used for point estimation of recombination parameters from population genetic data. In developing our hypothesis testing procedure based on the likelihood ratio test (LRT), we adopt the likelihood function of the OVER-PAINT model that offers greater flexibility by allowing for overlapping gene conversions [4,7]. Throughout this paper, ρ and γ are used to refer to the population-scaled crossover and gene conversion rates (per kb), respectively. The mean length of gene conversion tracts (kb) is denoted by λ.

The PAC model and the GenCo model
In principle, given a set of n haplotypes H = {h 1 , . . . , h n } sampled from a natural population, the estimation of ρ, γ and λ can be obtained by maximizing the likelihood function (ρ, γ , λ) := P(H | ρ, γ , λ). However, unless we can examine the true genealogical history of sampled sequences in the population [8], which is rarely available in a population genetic study, we are unable to compute the exact likelihood function in most models of interest. To be precise, where the integral is over all possible genealogies G and P(G | ρ, γ , λ) is modeled by the coalescent process with crossovers and gene conversions [9,10]. The above likelihood computation is notoriously difficult because the number of genealogies G consistent with the sampled haplotypes H, where the consistency is determined by P(H | G), grows extremely fast as the length of sampled haplotypes increases [11]. Several approximate-likelihood approaches have therefore been developed to approximate the likelihood surface. The 'Product of Approximate Conditionals' (PAC) model, first proposed in [2], makes use of the fact that the joint likelihood of the sampled haplotypes can be decomposed into a product of conditional probabilities: However, the exact conditional probabilities P(h k+1 | h 1 , . . . , h k , ρ, γ , λ) are largely unknown for the coalescent models with recombination. Using efficiently computable approximationsπ to substitute for the exact conditional probabilities P, the following approximation to the joint likelihood has been suggested in [2]: Instead of maximizing the true but intractable likelihood function , the idea of the PAC model is to use the approximate likelihood PAC as a surrogate function to estimate recombination parameters from the sampled haplotypes. The original PAC model [2] has only considered the estimation of the crossover rate ρ, in which case PAC becomes a one dimensional function. Gay et al. [3] have extended the model by incorporating gene conversion events, and their model GenCo can be used to jointly estimate the crossover rate ρ, the gene conversion rate γ , and the mean conversion tract λ. http://www.biomedcentral.com/1471-2156/15/122 The choice of the approximate conditional probabilitieŝ π(h k+1 | h 1 , . . . , h k , ρ, γ , λ) in the GenCo model assumes that h k+1 is an imperfect mosaic copy of h 1 , . . . , h k . In particular, h k+1 is considered to consist of a mixture of segments from h 1 , . . . , h k with a small number of mutations, and its mosaic structure is the result of a joint effort by the crossover and gene conversion events. To capture this imperfect copying process, Gay et al. [3] have designed a factorial hidden Markov model (HMM) [12,13] with two independent hidden chains. The crossover chain is modeled as a Poisson process with rate ρ along the sequence; for the gene conversion chain, both initiation and termination of a conversion tract are modeled as Poisson processes, with rates γ and 1/λ respectively. The joint configuration of the states in these two chains determines the index of the haplotype from which the copying is performed. See [3] and Figure two(a) in [4] for more details.

The OVERPAINT model
Because gene conversion events involve non-reciprocal transfer of genetic information between homologous sequences, the typical product created by a gene conversion event is a descendant sequence that consists of a prefix of a sequence h followed by a short internal fragment of another sequence h , which is then followed by a suffix of the first sequence h. However, the independent assumption of the two hidden chains in the factorial HMM formulation of the GenCo model cannot capture this alternating pattern of the descendant sequence. An improved model called OVERPAINT based on an interleaved HMM ( Figure 1) is introduced in [4]. The desired alternating pattern is achieved by coupling the crossover  [4]]. h k+1,j is the allele state at the j-th site of haplotype h k+1 . X j and G j denote the j-th hidden state of the crossover and gene conversion chain, respectively, and their joint configuration determines the index of the haplotype from which h k+1,j is copied. and gene conversion chains as well as by defining their new transition probabilities. In Figure 1, direct edges from the gene conversion chain to the crossover chain constrain the crossover chain to stay in the same state as the previous site whenever the current site is in a gene conversion tract. To be precise, the transition probability of the crossover chain is specified as If site j + 1 is not in a conversion tract (G j+1 is in the null state ∅), the crossover chain evolves according to the same Poisson process as defined in the GenCo model [3]. Otherwise, if site j + 1 is in a conversion tract (G j+1 = ∅), the crossover chain keeps track of the state in the previous site, i.e., the indicator function I sets X j+1 = X j .
In addition to constructing coupled hidden chains to capture the alternating pattern of gene conversion, another key feature of the OVERPAINT model is to allow for overlapping gene conversion events in the copying process. This is motivated by the observation that it is possible for the coalescent model with gene conversion to generate genealogies in which the gene conversion tracts partially overlap or are completely nested within each other. See [4] and [7] for details of the OVER-PAINT model, including the exact form of the initial and transition probabilities of hidden chains as well as the forward-backward algorithm to compute the approximate conditional probabilitiesπ(h k+1 | h 1 , . . . , h k , ρ, γ , λ).
Finally, by taking into account the prior information that the tract length typically ranges between 0.05 and 2 kb [14,15], a prior on the mean tract length λ can be imposed: where N(μ, σ 2 ) denotes a standard normal distribution with mean μ and variance σ 2 . This prior is deliberately chosen to ensure P(λ ∈[ 0.05, 2] ) = 95%. A standard derivative-free optimization algorithm, the Nelder-Mead simplex-reflection method [16], is applied to find the best point estimates of ρ, γ , λ by maximizing the posterior Here, we use OVERPAINT (ρ, γ , λ) to refer to the likelihood function of the OVERPAINT model and f (λ) to denote the density of λ that corresponds to (1). The prior can also be interpreted as a regularizer to penalize very http://www.biomedcentral.com/1471-2156/15/122 small or very large values of λ, and hence can yield more stable numerical results [7].

Motivation examples
In the settings of nonzero crossover and nonzero gene conversion rates, the studies in [4,7] have shown that the OVERPAINT model provides a substantial improvement over the GenCo model in the accuracy of point estimation. However, as we will show below, the point estimators tend to be inflated and thus become unreliable when one of the recombination rates lies on the boundary of the parameter space, i.e., ρ = 0 or γ = 0. In conducting the simulation, 100 data sets with gene conversions only (ρ = 0) and crossovers only (γ = 0 and λ = 0), respectively, are independently generated by the coalescent simulation program MS [17]. In each simulation, we generate a 20 kb region using θ = 1.0/kb for the mutation rate and λ = 0.5 kb for the mean tract length if the gene conversion rate γ = 0, then we obtain the point estimation of all three parameters ρ, γ and λ by maximizing (2). Table 1 summarizes the parameter estimates on the data sets generated with gene conversions only (i.e., the crossover rate ρ = 0). The column labeledρ displays the mean and standard deviation (shown in parentheses) of the estimates of ρ. It indicates that the estimates of ρ are well behaved over a range of simulated data sets with gene conversion rate γ = 0.5, 1.0, 2.5, 5.0, 10.0/kb, though they are slightly biased upward on the data sets simulated with a large gene conversion rate (γ = 10.0/kb). In contrast, as the column labeledγ of Table 2 shows, the estimates of γ are significantly inflated when there is actually no gene conversion (i.e., γ = 0). Gay et al. [3] have made the same observation about an overestimation of the gene conversion rate γ by their model GenCo, when gene conversion is nonexistent (see their Figure three).
In what follows, we will mainly focus on testing the null hypothesis H 0 : γ = 0 (no gene conversion), but our testing procedure as outlined in Algorithm 1 can also be easily modified to testing H 0 : ρ = 0, as we will demonstrate in the section of "Results and discussion". For each value of the gene conversion rate γ (per kb), 100 data sets with a sample size n = 20 are independently generated using the MS program [17] with a mutation rate θ = 1.0/kb and a mean tract length λ = 0.5 kb. a The mean and SD (in parenthesis) of the parameter estimates. b #(ρ; k): the number of data sets withρ in the range (0, kγ ). For each value of the crossover rate ρ (per kb), 100 data sets with a sample size n = 20 are independently generated using the MS program [17] with a mutation rate θ = 1.0/kb. a The mean and SD (in parenthesis) of the parameter estimates. b #γ ; k : the number of data sets withγ in the range (0, kρ).

Parametric bootstrap
It seems inevitable to obtain an overestimation of the gene conversion rate when γ = 0 because the true value lies on the boundary of the possible range. We formulate and address this problem in a hypothesis testing framework, and devise a testing procedure based on the likelihood ratio test (LRT). Our null hypothesis is H 0 : γ = 0 (no gene conversion), and the test statistic of the sampled haplotypes H is the likelihood ratio statistic: where L OVERPAINT (ρ, 0, 0 | H) denotes the function in (2) computed with crossover rate ρ only (i.e., the original PAC model in [2]). As usual, large values of the observed statistic (H) would lead us to favor the alternative hypothesis and possibly to reject the null hypothesis H 0 . The key question is: what is the critical value of (H) used to reject H 0 ? One might conjecture that the LRT statistic in (3) would follow an asymptotic χ 2 2 distribution under the null hypothesis. However, as Figure 2 and Additional file 1: Figure S1 show, the null distribution of the LRT statistic (H) is not well approximated by the desired χ 2 2 distribution, as least not for a sample size of n = 35. Even for larger sample sizes, we believe that the chi-squared approximation is still inaccurate because of two facts: first, the null value lies on the boundary of the parameter space; second, the model is not identifiable, i.e., two distinct parameter settings γ = 0 and λ = 0 give rise to the same likelihood. Therefore, the regularity conditions of the classical large-sample theory are violated, and it becomes invalid to apply the standard large-sample approximation to the distribution of the LRT statistic (H) [18].
As Figure 2 and Additional file 1: Figure S1 show, the null distribution of the LRT statistic (H) and its critical http://www.biomedcentral.com/1471-2156/15/122 value (the 95% quantile) depends on the crossover rate ρ, which is an unknown nuisance parameter under the null hypothesis H 0 . This observation motivates us to develop a parametric bootstrap procedure [19] to obtain an approximate p-value for the observed test statistic (H), as outlined in Algorithm 1. Instead of constructing the whole null distribution of the LRT statistic, we draw B samples of size n from the null hypothesis with a crossover rate ofρ, which is the parametric estimate of the nuisance parameter ρ under H 0 . We then evaluate the test statistic on each bootstrap sample, and count the proportion that exceed the observed statistic.  These estimates, computed asρ = argmax ρ L OVERPAINT (ρ, 0, 0), are used to draw bootstrap replications (line 4 in Algorithm 1) and then to estimate the bootstrap p-values (as in Figure 3 and Additional file 2: Figure S2). a The mean and SD (in parenthesis) of the estimates of ρ. b #ρ; k: the number of data sets withρ within a factor of k from the true ρ.

Simulation study
To evaluate the performance of our testing procedure, we use the same parameter settings as in the section "Motivation examples" to conduct the simulation. All reported p-values are based on B = 200 bootstrap samples.

p-values under the null hypothesis
Under the null hypothesis H 0 : γ = 0, we use the values 0.5, 1.0, 2.5, 5.0 and 10.0/kb for the crossover rate ρ (the nuisance parameter). For each value of ρ, we generate 100 simulated data sets with sample sizes of n = 20 and n = 35 haplotypes, respectively. We then apply our parametric bootstrap procedure presented in Algorithm 1 to compute an estimate of the p-value for each data set. Figure 3 and Additional file 1: Figure S2 show that the bootstrap estimates of the null p-values closely follow the uniform distribution over the interval (0, 1), thereby exhibiting excellent agreement with theoretical prediction. Table 3 summarizes the estimated nuisance parameter ρ under the null hypothesis (line 3 in Algorithm 1) that are used to draw bootstrap replications (line 4 in Algorithm 1). Though the estimates are slightly biased downwards for large values of true ρ, the empirical behavior shown in Figure 3 and Additional file 1: Figure S2 suggests that it suffices to draw bootstrap samples from approximately correct null distributions in our case to obtain good estimates of the null p-values.

p-values under the alternative hypothesis
Under the alternative hypothesis H 1 : γ = 0, different combinations of ρ and γ are chosen in the simulation, and the ratio of gene conversion to crossover rate f = γ /ρ ranges over 0.5, 1.0, 2.5, 5.0 and 10.0. For each parameter setting, we generate 100 data sets with a mutation rate θ = 1.0/kb, a mean tract length λ = 0.5 kb, and sample sizes n = 20 and n = 35, respectively. Figure 4 shows the bootstrap estimates of the alternative p-values and the power of the test when setting the p-value threshold to 0.05. As the rate ratio f = γ /ρ or the sample size n increases, the alternative p-values tend to decrease  towards 0, leading to increased power of detecting gene conversion.

A real biological application
We apply our testing procedure to SNP data sets from two genes, su(s) and su(w a ), located near the telomere of the X chromosome of African Drosophila melanogaster [20].  Tables 4  and 5.
For the su(s) locus, the p-values against H 0 : ρ = 0 for all the segments (including the whole locus) show no evidence of detecting crossover. However, a small p-value (0.01) against H 0 : γ = 0 is observed for the shortest segment s3, and the overall effect is to provide a strong evidence of gene conversion for the whole locus (p-value = 0.03). This is consistent with the conclusion that gene conversion is likely to play a leading role in shaping the observed polymorphism in this region [20].
A similar pattern of the p-values holds for the su(w a ) locus, except that the p-values against H 0 : γ = 0 and H 0 : ρ = 0 for the shortest segment s1 are both significant at the 5% level: 0.01 and 0.03, respectively. This could imply that while gene conversion rate is high in this short segment, crossover may not have been greatly suppressed. It could also suggest a higher proportion of gene conversions that are accompanied by crossover events.

Conclusion
In this work, we have introduced a hypothesis testing procedure that can provide statistical evidence from population genetic data about whether one of the two recombination events is absent. By extensive coalescent simulation studies, we have shown that our parametric bootstrap approach is able to yield accurate estimates of the null p-values that closely follow the theoretical prediction. On the other hand, the bootstrap estimates of the alternative p-values tend to concentrate closer to zero. Our results on real SNP data sets from the su(s) and su(w a ) loci of African D. melanogaster indicate a strong evidence of detecting gene conversion in short segments of these regions. Moreover, crossover may also play an important role in a short segment of the su(w a ) locus. We believe that our method provides a necessary complement to the existing procedures of estimating meiotic recombination rates from population genetic data, and expect it to be applied to other data sets.