# Hypothesis testing of meiotic recombination rates from population genetic data

- Junming Yin
^{1}Email author

**15**:122

https://doi.org/10.1186/s12863-014-0122-7

© Yin; licensee BioMed Central Ltd. 2014

**Received: **6 May 2014

**Accepted: **28 October 2014

**Published: **30 November 2014

## Abstract

### Background

Meiotic recombination, one of the central biological processes studied in population genetics, comes in two known forms: crossovers and gene conversions. A number of previous studies have shown that when one of these two events is nonexistent in the genealogical model, the point estimation of the corresponding recombination rate by population genetic methods tends to be inflated. Therefore, it has become necessary to obtain statistical evidence from population genetic data about whether one of the two recombination events is absent.

### Results

In this paper, we formulate this problem in a hypothesis testing framework and devise a testing procedure based on the likelihood ratio test (LRT). However, because the null value (i.e., zero) lies on the boundary of the parameter space, the regularity conditions for the large-sample approximation to the distribution of the LRT statistic do not apply. In turn, the standard chi-squared approximation is inaccurate. To address this critical issue, we propose a parametric bootstrap procedure to obtain an approximate *p*-value for the observed test statistic. Coalescent simulations are conducted to show that our approach yields accurate null *p*-values that closely follow the theoretical prediction while the estimated alternative *p*-values tend to concentrate closer to zero. Finally, the method is demonstrated on a real biological data set from the telomere of the *X* chromosome of African *Drosophila melanogaster*.

### Conclusions

Our methodology provides a necessary complement to the existing procedures of estimating meiotic recombination rates from population genetic data.

## Keywords

## Background

Meiotic recombination is one of the essential evolutionary factors responsible for promoting genetic diversity within species. There are two major types of meiotic recombination events: crossovers and gene conversions. Unlike crossover, which is a reciprocal event, gene conversion is a unidirectional event that involves the transfer of a short segment of one parental chromosome (called a ‘conversion tract’) to the other parental chromosome. Crossovers and gene conversions play different roles in shaping the pattern of linkage disequilibrium (LD) observed in natural populations: “Recombination between pairs of markers that are far apart are almost exclusively crossovers, whereas pairs of markers that are close together are affected by both crossovers and gene conversion events” [1]. Thus, studying these two biological processes and characterizing their basic parameters has direct implications for population genetic studies.

There is a growing body of work on coalescent-based statistical approaches to jointly estimating the crossover rate, the gene conversion rate, and the mean conversion tract length from population genetic data. Building on a popular framework called the “Product of Approximate Conditionals” (PAC) model [2], Gay et al. [3] have proposed a likelihood-based method to incorporate gene conversion events. Yin et al. [4] have extended and improved the model further by explicitly modeling overlapping gene conversion events. On the flip side of these two frequentist approaches, Bayesian Markov chain Monte Carlo (MCMC) techniques have also been developed to estimate recombination rates from population genetic data [5],[6].

Despite the marked progress in the joint estimation of the aforementioned three parameters, these methods are less suitable when one of the two recombination events is absent in the genealogical model. The corresponding population parameter, especially the gene conversion rate when the gene conversion event is nonexistent, tends to be overestimated by the maximum likelihood (or maximum a posteriori) point estimation. This is unfortunately inevitable because the true parameter value (i.e., zero) lies on the boundary of the parameter space. The use of inaccurate parameters may limit the efficacy of these approaches, and can also hinder population genetic analyses based on these estimators. Therefore, it has become necessary to obtain statistical evidence from population genetic data about whether one of the two recombination events is absent.

The goal of this article is to propose a rigorous procedure to perform hypothesis testing for this problem. Our approach is based on the likelihood ratio test (LRT). One of the classical regularity conditions for the asymptotic distribution of the LRT statistic requires the null value to be an interior point in the parameter space. However, because this condition is not satisfied, it is invalid to apply the standard chi-squared approximation in this setting. We thus develop a parametric bootstrap procedure to obtain an approximate *p*-value of the observed test statistic. Coalescent-based simulations are conducted to demonstrate the soundness and effectiveness of our approach. The bootstrap estimates of the null p-values closely follow the theoretical prediction, while the estimated alternative *p*-values tend to concentrate closer to zero. Finally, we apply the method to a real biological data set from the telomere of the *X* chromosome of African *D. melanogaster*. The result suggests that while gene conversion is likely to play a leading role in shaping the observed polymorphism in these regions, crossover may not have been greatly suppressed in a short segment of *s* *u*(*w* ^{
a
}) locus.

## Methods

We begin by reviewing some previous statistical models used for *point estimation* of recombination parameters from population genetic data. In developing our *hypothesis testing* procedure based on the likelihood ratio test (LRT), we adopt the likelihood function of the OVERPAINT model that offers greater flexibility by allowing for overlapping gene conversions [4],[7]. Throughout this paper, *ρ* and *γ* are used to refer to the population-scaled crossover and gene conversion rates (per kb), respectively. The mean length of gene conversion tracts (kb) is denoted by *λ*.

### The PAC model and the GenCo model

*n*haplotypes

*H*={

*h*

_{1},...,

*h*

_{ n }} sampled from a natural population, the estimation of

*ρ*,

*γ*and

*λ*can be obtained by maximizing the likelihood function . However, unless we can examine the true genealogical history of sampled sequences in the population [8], which is rarely available in a population genetic study, we are unable to compute the exact likelihood function in most models of interest. To be precise,

*G*and is modeled by the coalescent process with crossovers and gene conversions [9],[10]. The above likelihood computation is notoriously difficult because the number of genealogies

*G*consistent with the sampled haplotypes

*H*, where the consistency is determined by , grows extremely fast as the length of sampled haplotypes increases [11]. Several approximate-likelihood approaches have therefore been developed to approximate the likelihood surface. The ‘Product of Approximate Conditionals’ (PAC) model, first proposed in [2], makes use of the fact that the joint likelihood of the sampled haplotypes can be decomposed into a product of conditional probabilities:

Instead of maximizing the true but intractable likelihood function *l*, the idea of the PAC model is to use the approximate likelihood *l* _{PAC} as a surrogate function to estimate recombination parameters from the sampled haplotypes. The original PAC model [2] has only considered the estimation of the crossover rate *ρ*, in which case *l* _{PAC} becomes a one dimensional function. Gay et al. [3] have extended the model by incorporating gene conversion events, and their model GenCo can be used to jointly estimate the crossover rate *ρ*, the gene conversion rate *γ*, and the mean conversion tract *λ*.

The choice of the approximate conditional probabilities
in the GenCo model assumes that *h* _{k+1} is an imperfect mosaic copy of *h* _{1},...,*h* _{
k
}. In particular, *h* _{k+1} is considered to consist of a mixture of segments from *h* _{1},...,*h* _{
k
} with a small number of mutations, and its mosaic structure is the result of a joint effort by the crossover and gene conversion events. To capture this imperfect copying process, Gay et al. [3] have designed a factorial hidden Markov model (HMM) [12],[13] with two independent hidden chains. The crossover chain is modeled as a Poisson process with rate *ρ* along the sequence; for the gene conversion chain, both initiation and termination of a conversion tract are modeled as Poisson processes, with rates *γ* and 1/*λ* respectively. The joint configuration of the states in these two chains determines the index of the haplotype from which the copying is performed. See [3] and Figure two(a) in [4] for more details.

### The OVERPAINTmodel

*h*followed by a short internal fragment of another sequence

*h*′, which is then followed by a suffix of the first sequence

*h*. However, the independent assumption of the two hidden chains in the factorial HMM formulation of the GenCo model cannot capture this alternating pattern of the descendant sequence. An improved model called OVERPAINT based on an interleaved HMM (Figure 1) is introduced in [4]. The desired alternating pattern is achieved by coupling the crossover and gene conversion chains as well as by defining their new transition probabilities. In Figure 1, direct edges from the gene conversion chain to the crossover chain constrain the crossover chain to stay in the same state as the previous site whenever the current site is in a gene conversion tract. To be precise, the transition probability of the crossover chain is specified as

If site *j*+1 is not in a conversion tract (*G* _{j+1} is in the null state *∅*), the crossover chain evolves according to the same Poisson process as defined in the GenCo model [3]. Otherwise, if site *j*+1 is in a conversion tract (*G* _{j+1}=*∅*), the crossover chain keeps track of the state in the previous site, i.e., the indicator function sets *X* _{j+1}=*X* _{
j
}.

In addition to constructing coupled hidden chains to capture the alternating pattern of gene conversion, another key feature of the OVERPAINT model is to allow for *overlapping* gene conversion events in the copying process. This is motivated by the observation that it is possible for the coalescent model with gene conversion to generate genealogies in which the gene conversion tracts partially overlap or are completely nested within each other. See [4] and [7] for details of the OVERPAINT model, including the exact form of the initial and transition probabilities of hidden chains as well as the forward-backward algorithm to compute the approximate conditional probabilities
.

*λ*can be imposed:

*N*(

*μ*,

*σ*

^{2}) denotes a standard normal distribution with mean

*μ*and variance

*σ*

^{2}. This prior is deliberately chosen to ensure . A standard derivative-free optimization algorithm, the Nelder-Mead simplex-reflection method [16], is applied to find the best point estimates of

*ρ*,

*γ*,

*λ*by maximizing the posterior

Here, we use *L* _{OVERPAINT}(*l*,*l*,*l*) to refer to the likelihood function of the OVERPAINT model and *f*(*λ*) to denote the density of *λ* that corresponds to (1). The prior can also be interpreted as a regularizer to penalize very small or very large values of *λ*, and hence can yield more stable numerical results [7].

### Motivation examples

In the settings of nonzero crossover and nonzero gene conversion rates, the studies in [4],[7] have shown that the OVERPAINT model provides a substantial improvement over the GenCo model in the accuracy of point estimation. However, as we will show below, the point estimators tend to be inflated and thus become unreliable when one of the recombination rates lies on the boundary of the parameter space, i.e., *ρ*=0 or *γ*=0. In conducting the simulation, 100 data sets with gene conversions only (*ρ*=0) and crossovers only (*γ*=0 and *λ*=0), respectively, are independently generated by the coalescent simulation program MS [17]. In each simulation, we generate a 20 kb region using *θ*=1.0/kb for the mutation rate and *λ*=0.5 kb for the mean tract length if the gene conversion rate *γ*=0, then we obtain the point estimation of all three parameters *ρ*,*γ* and *λ* by maximizing (2).

*ρ*=0). The column labeled displays the mean and standard deviation (shown in parentheses) of the estimates of

*ρ*. It indicates that the estimates of

*ρ*are well behaved over a range of simulated data sets with gene conversion rate

*ρ*=0.5,1.0,2.5,5.0,10.0/kb, though they are slightly biased upward on the data sets simulated with a large gene conversion rate (

*ρ*=10.0/kb). In contrast, as the column labeled of Table 2 shows, the estimates of

*ρ*are significantly inflated when there is actually no gene conversion (i.e.,

*ρ*=0). Gay et al. [3] have made the same observation about an overestimation of the gene conversion rate

*ρ*by their model GenCo, when gene conversion is nonexistent (see their Figure three).

**Summary of parameter estimates on simulated data sets with gene conversions only (**
ρ
**=0)**

**Summary of parameter estimates on simulated data sets with crossovers only (**
γ
**=0)**

In what follows, we will mainly focus on testing the null hypothesis *H* _{0}:*ρ*=0 (no gene conversion), but our testing procedure as outlined in Algorithm 1 can also be easily modified to testing *H* _{0}:*ρ*=0, as we will demonstrate in the section of “Results and discussion”.

### Parametric bootstrap

*ρ*=0 because the true value lies on the boundary of the possible range. We formulate and address this problem in a hypothesis testing framework, and devise a testing procedure based on the likelihood ratio test (LRT). Our null hypothesis is

*H*

_{0}:

*ρ*=0 (no gene conversion), and the test statistic of the sampled haplotypes

*H*is the likelihood ratio statistic:

where *L* _{OVERPAINT}(*ρ*,0,0 *H*) denotes the function in (2) computed with crossover rate *ρ* only (i.e., the original PAC model in [2]).

*ρ*(

*H*) would lead us to favor the alternative hypothesis and possibly to reject the null hypothesis

*H*

_{0}. The key question is: what is the critical value of

*ρ*(

*H*) used to reject

*H*

_{0}? One might conjecture that the LRT statistic in (3) would follow an asymptotic distribution under the null hypothesis. However, as Figure 2 and Additional file 1: Figure S1 show, the null distribution of the LRT statistic

*ρ*(

*H*) is not well approximated by the desired distribution, as least not for a sample size of

*n*=35. Even for larger sample sizes, we believe that the chi-squared approximation is still inaccurate because of two facts: first, the null value lies on the boundary of the parameter space; second, the model is not identifiable, i.e., two distinct parameter settings

*ρ*=0 and

*ρ*=0 give rise to the same likelihood. Therefore, the regularity conditions of the classical large-sample theory are violated, and it becomes invalid to apply the standard large-sample approximation to the distribution of the LRT statistic

*ρ*(

*H*) [18].

As Figure 2 and Additional file 1: Figure S1 show, the null distribution of the LRT statistic *ρ*(*H*) and its critical value (the 95*%* quantile) depends on the crossover rate *ρ*, which is an unknown *nuisance parameter* under the null hypothesis *H* _{0}. This observation motivates us to develop a parametric bootstrap procedure [19] to obtain an approximate *p*-value for the observed test statistic *ρ*(*H*), as outlined in Algorithm 1. Instead of constructing the whole null distribution of the LRT statistic, we draw *B* samples of size *n* from the null hypothesis with a crossover rate of
, which is the parametric estimate of the nuisance parameter *ρ* under *H* _{0}. We then evaluate the test statistic on each bootstrap sample, and count the proportion that exceed the observed statistic.

## Results and discussion

### Simulation study

To evaluate the performance of our testing procedure, we use the same parameter settings as in the section “Motivation examples” to conduct the simulation. All reported *p*-values are based on *B*=200 bootstrap samples.

*p*-values under the null hypothesis

*H*

_{0}:

*ρ*=0, we use the values 0.5,1.0,2.5,5.0 and 10.0/kb for the crossover rate

*ρ*(the nuisance parameter). For each value of

*ρ*, we generate 100 simulated data sets with sample sizes of

*n*=20 and

*n*=35 haplotypes, respectively. We then apply our parametric bootstrap procedure presented in Algorithm 1 to compute an estimate of the

*p*-value for each data set. Figure 3 and Additional file 1: Figure S2 show that the bootstrap estimates of the null

*p*-values closely follow the uniform distribution over the interval (0,1), thereby exhibiting excellent agreement with theoretical prediction. Table 3 summarizes the estimated nuisance parameter

*ρ*under the null hypothesis (line 3 in Algorithm 1) that are used to draw bootstrap replications (line 4 in Algorithm 1). Though the estimates are slightly biased downwards for large values of true

*ρ*, the empirical behavior shown in Figure 3 and Additional file 1: Figure S2 suggests that it suffices to draw bootstrap samples from approximately correct null distributions in our case to obtain good estimates of the null

*p*-values.

**Summary of the estimated nuisance parameter** γ **under the null hypothesis** H _{
0
}: γ **=0**

*p*-values under the alternative hypothesis

*H*

_{1}:

*ρ*0, different combinations of

*ρ*and

*ρ*are chosen in the simulation, and the ratio of gene conversion to crossover rate

*f*=

*ρ*/

*ρ*ranges over 0.5,1.0,2.5,5.0 and 10.0. For each parameter setting, we generate 100 data sets with a mutation rate

*ρ*=1.0/kb, a mean tract length

*ρ*=0.5 kb, and sample sizes

*n*=20 and

*n*=35, respectively. Figure 4 shows the bootstrap estimates of the alternative

*p*-values and the power of the test when setting the

*p*-value threshold to 0.05. As the rate ratio

*f*=

*ρ*/

*ρ*or the sample size

*n*increases, the alternative

*p*-values tend to decrease towards 0, leading to increased power of detecting gene conversion.

### A real biological application

*s*

*u*(

*s*) and

*s*

*u*(

*w*

^{ a }), located near the telomere of the

*X*chromosome of African

*Drosophila melanogaster*[20]. The lengths of

*s*

*u*(

*s*) and

*s*

*u*(

*w*

^{ a }) loci are about 4.1 kb and 2.5 kb, respectively, and they are about 400 kb apart. The

*s*

*u*(

*s*) locus contains 50 haplotypes and 41 SNPs, and the

*s*

*u*(

*w*

^{ a }) locus contains 50 haplotypes and 46 SNPs. The two data sets are further divided into overlapping segments of 20 SNPs each (except for the last segment with 21 SNPs), with 15 SNPs of overlap between two adjacent segments. For each segment, we apply our parametric bootstrap procedure with

*B*=500 bootstrap samples. The estimated

*p*-values for the null hypotheses

*H*

_{0}:

*ρ*=0 and

*H*

_{0}:

*ρ*=0 are shown in Tables 4 and 5.

**Bootstrap**
p
**-values for segments of the**
su(s)
**locus in**
D. melanogaster

Segment | s1 | s2 | s3 | s4 | s5 | All |
---|---|---|---|---|---|---|

Length (kb) | 1.8 | 1.8 | 1.6 | 2.4 | 2.3 | 4.1 |

| 0.32 | 0.66 | 0.01 | 0.15 | 0.54 | 0.03 |

| 0.86 | 0.46 | 0.92 | 0.64 | 0.36 | 0.45 |

**Bootstrap**
p
**-values for segments of the**
s
u
**(**
w
^{
a
}
**) locus in**
D. melanogaster

Segment | s1 | s2 | s3 | s4 | s5 | s6 | All |
---|---|---|---|---|---|---|---|

Length (kb) | 0.4 | 1.0 | 1.1 | 1.8 | 1.2 | 1.5 | 2.5 |

| 0.01 | 0.17 | 0.31 | 0.19 | 0.22 | 0.30 | 0.0 |

| 0.03 | 0.71 | 0.49 | 0.55 | 0.88 | 0.89 | 0.31 |

For the *s* *u*(*s*) locus, the *p*-values against *H* _{0}:*ρ*=0 for all the segments (including the whole locus) show no evidence of detecting crossover. However, a small *p*-value (0.01) against *H* _{0}:*ρ*=0 is observed for the shortest segment s3, and the overall effect is to provide a strong evidence of gene conversion for the whole locus (*p*-value = 0.03). This is consistent with the conclusion that gene conversion is likely to play a leading role in shaping the observed polymorphism in this region [20].

A similar pattern of the *p*-values holds for the *s* *u*(*w* ^{
a
}) locus, except that the *p*-values against *H* _{0}:*ρ*=0 and *H* _{0}:*ρ*=0 for the shortest segment s1 are both significant at the 5% level: 0.01 and 0.03, respectively. This could imply that while gene conversion rate is high in this short segment, crossover may not have been greatly suppressed. It could also suggest a higher proportion of gene conversions that are accompanied by crossover events.

## Conclusion

In this work, we have introduced a hypothesis testing procedure that can provide statistical evidence from population genetic data about whether one of the two recombination events is absent. By extensive coalescent simulation studies, we have shown that our parametric bootstrap approach is able to yield accurate estimates of the null *p*-values that closely follow the theoretical prediction. On the other hand, the bootstrap estimates of the alternative *p*-values tend to concentrate closer to zero. Our results on real SNP data sets from the *s* *u*(*s*) and *s* *u*(*w* ^{
a
}) loci of African *D. melanogaster* indicate a strong evidence of detecting gene conversion in short segments of these regions. Moreover, crossover may also play an important role in a short segment of the *s* *u*(*w* ^{
a
}) locus. We believe that our method provides a necessary complement to the existing procedures of estimating meiotic recombination rates from population genetic data, and expect it to be applied to other data sets.

## Additional files

## Declarations

### Acknowledgements

The author would like to acknowledge BioMed Central for a waiver of the article processing charge. I would also like to thank Prof. Yun S. Song, Prof. Michael I. Jordan, and Dr. Danping Liu for helpful suggestions and discussions. An allocation of computer time from the UA Research Computing High Performance Computing (HPC) and High Throughput Computing (HTC) at the University of Arizona is gratefully acknowledged.

## Authors’ Affiliations

## References

- Wall JD: Close look at gene conversion hot spots . Nat Genet. 2004, 36 (2): 114-115. 10.1038/ng0204-114.PubMedView ArticleGoogle Scholar
- Li N, Stephens M: Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data . Genetics. 2003, 165 (4): 2213-2233.PubMedPubMed CentralGoogle Scholar
- Gay JC, Myers S, McVean G: Estimating meiotic gene conversion rates from population genetic data . Genetics. 2007, 177 (2): 881-894. 10.1534/genetics.107.078907.PubMedPubMed CentralView ArticleGoogle Scholar
- Yin J, Jordan MI, Song YS: Joint estimation of gene conversion rates and mean conversion tract lengths from population SNP data . Bioinformatics. 2009, 25 (12): 231-239. 10.1093/bioinformatics/btp229.View ArticleGoogle Scholar
- Wang Y, Rannala B: Population genomic inference of recombination rates and hotspots . Proc Natl Acad Sci. 2009, 106 (15): 6215-6219. 10.1073/pnas.0900418106.PubMedPubMed CentralView ArticleGoogle Scholar
- Padhukasahasram B, Rannala B: Bayesian population genomic inference of crossing over and gene conversion . Genetics. 2011, 189 (2): 607-619. 10.1534/genetics.111.130195.PubMedPubMed CentralView ArticleGoogle Scholar
- Yin J: Computational methods for meiotic recombination inference. PhD thesis, University of California, Berkeley, Berkeley, CA, 2010.Google Scholar
- Kingman JFC: The coalescent . Stochastic Processes Appl. 1982, 13 (3): 235-248. 10.1016/0304-4149(82)90011-4.View ArticleGoogle Scholar
- Wiuf C, Hein J: The coalescent with gene conversion . Genetics. 2000, 155 (1): 451-462.PubMedPubMed CentralGoogle Scholar
- Wiuf C: A coalescence approach to gene conversion . Theor Popul Biol. 2000, 57 (4): 357-367. 10.1006/tpbi.2000.1462.PubMedView ArticleGoogle Scholar
- Song YS, Lyngsø R, Hein J: Counting all possible ancestral configurations of sample sequences in population genetics . IEEE/ACM Trans Comput Biol Bioinform. 2006, 3 (3): 239-251. 10.1109/TCBB.2006.31.PubMedView ArticleGoogle Scholar
- Rabiner L: A tutorial on HMM and selected applications in speech recognition . Proc IEEE. 1989, 77 (2): 257-286. 10.1109/5.18626.View ArticleGoogle Scholar
- Ghahramani Z, Jordan MI: Factorial hidden markov models . Mach Learn. 1997, 29: 245-273. 10.1023/A:1007425814087.View ArticleGoogle Scholar
- Hilliker AJ, Harauz G, Reaume AG, Gray M, Clark SH, Chovnick A: Meiotic gene conversion tract length distribution within the rosy locus of drosophila melanogaster . Genetics. 1994, 137 (4): 1019-1026.PubMedPubMed CentralGoogle Scholar
- Jeffreys AJ, May CA: Intense and highly localized gene conversion activity in human meiotic crossover hot spots . Nat Genet. 2004, 36 (2): 151-156. 10.1038/ng1287.PubMedView ArticleGoogle Scholar
- Nocedal J, Wright SJ: Numerical Optimization . 2000, Springer, New YorkGoogle Scholar
- Hudson RR: Generating samples under the Wright-Fisher neutral model of genetic variation . Bioinformatics. 2002, 18 (2): 337-338. 10.1093/bioinformatics/18.2.337.PubMedView ArticleGoogle Scholar
- Ferguson T: A Course in Large Sample Theory. Chapman & Hall/CRC Texts in Statistical Science . 1996, Chapman and Hall/CRC, United KingdomView ArticleGoogle Scholar
- Efron B, Tibshirani RJ: An Introduction to the Bootstrap. Chapman & Hall/CRC Monographs on Statistics & Applied Probability . 1994, Chapman and Hall/CRC, United KingdomGoogle Scholar
- Langley CH, Lazzaro BP, Phillips W, Heikkinen E, Braverman JM: Linkage disequilibria and the site frequency spectra in the su(s) and su ( w a ) regions of the Drosophila melanogaster X chromosome. Genetics2000, 156:1837-1852.,Google Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.