- Methodology article
- Open Access
An EM algorithm for mapping segregation distortion loci
BMC Genetics volume 8, Article number: 82 (2007)
Chromosomal region that causes distorted segregation ratios is referred to as segregation distortion locus (SDL). The distortion is caused either by differential representation of SDL genotypes in gametes before fertilization or by viability differences of SDL genotypes after fertilization but before genotype scoring. In both cases, observable phenotypes are distorted for marker loci in the chromosomal region close to the SDL. Under the quantitative genetics model for viability selection by proposing a continuous liability controlling the viability of individual, a simplex algorithm has been used to search for the solution in SDL mapping. However, they did not consider the effects of SDL on the construction of linkage maps.
We proposed a multipoint maximum-likelihood method to estimate the position and the effects of SDL under the liability model together with both selection coefficients of marker genotypes and recombination fractions. The method was implemented via an expectation and maximization (EM) algorithm. The superiority of the method proposed under the liability model over the previous methods was verified by a series of Monte Carlo simulation experiments, together with a working example derived from the MAPMAKER/QTL software.
Our results suggested that the new method can serve as a powerful alternative to existing methods for SDL mapping. Under the liability model, the new method can simultaneously estimate the position and the effects of SDL as well as the recombinant fractions between adjacent markers, and also be used to probe into the genetic mechanism for the bias of uncorrected map distance and to elucidate the relationship between the viability selection and genetic linkage.
In a segregation population derived from a cross between two inbred lines, some molecular markers often show distorted segregation ratios from Mendelian expectations [1–3]. The distortion is frequently related to gamete gene, sterile gene and chromosome translocation . So the detection of the gene or locus, known as segregation distortion locus (SDL) mapping, is warranted. However, the challenge encountered in SDL mapping is mainly caused by the unavailability of phenotypic data for the underlying trait. In fact, molecular markers linked to the SDL frequently show segregation distortion and the degree of distortion depends on the size and the position of SDL. Therefore, it is possible to detect SDL by means of the distortion.
Mapping SDL is usually studied at the population level by examining the change of gene (or genotypic) frequencies of markers . In the past a single marker was often used to detect the linkage between the marker and SDL [6, 7]. Its shortcomings are very similar to those of single-marker approaches in quantitative trait loci (QTL) mapping . Since the introduction of interval mapping of QTL , Hedrick and Muona  developed a flanking-marker analysis to estimate the fitness parameters for a viability locus. The model of Hedrick and Muona  is actually a complete recessive model. Mitchell-Olds  detected one putative viability locus at a time and then scanned the entire genome for every putative position to provide a test statistic profile for the detection of SDL. However, his model only test and estimate the degree of dominance. Luo and Xu  extended the maximum-likelihood (ML) method to estimate degree of dominance and selection coefficients using an outbred full-sib family as an example. Wang et al.  developed a multipoint ML method to estimate the position and the genotypic frequencies of SDL in an F2 population. However, the efficacies of the methods mentioned above have been seldom addressed in simulation studies. Recently, Luo et al.  developed a quantitative genetics model for viability selection. This approach makes it possible to carry out simulation studies, to partition the selection into additive and dominant effects and to remove the effects of non-genetic cofactors from the analysis [14, 15]. However, this approach raises two issues. Firstly, they assumed that segregation distortion didn't affect the construction of genetic linkage map. In fact, marker segregation distortion is known to affect the estimates for both recombination fractions in pair-wise analysis of markers and the order of the markers on a linkage group [16–18]. As for the genetic parameters, then, Luo et al  adopted the Simplex algorithm  to search for the solutions at the cost of computational consuming. Under the liability model proposed by Luo et al , therefore, in this paper it is necessary to extend the multipoint approach by combining the estimations of the genetic parameters of SDL with the reconstruction of genetic linkage maps. The new method for SDL mapping was implemented via an expectation and maximization (EM) algorithm rather than Simplex procedure. The genetic factors that might affect the estimates of recombination fractions between adjacent markers would be discussed in detail. A series of Monte Carlo simulation experiments together with a working example from the Mapmaker/QTL software were carried out to verify our approach.
Considering an SDL in an F2 population derived from a cross between two inbred lines, we assumed three genotypes at this locus, AA, Aa and aa, to have genotypic values a - d, d and -a - d, respectively, with a and d indicating additive and dominant effects, and an imaginary trait, liability, invisible to the investigators but visible to nature, controlled the viabilities of individuals. It should be noted that the genetic variance in an F2 population was a2 + d2 rather than as usual. The phenotypic value of the j th individual was described by the following linear model,
z j = g j + e j (1)
where g j was the genotypic value for the j th individual, and ε j a normally distributed residual variable with mean zero and standard deviation 1.0, which accounted for polygenes that were linked to the markers and for environmental variation [14, 18]. Provided that the liability was subject to natural selection, an individual would survive if z j ≥ 0 and would be eliminated from the population if z j < 0. Since all the sampled individuals had survived from the viability selection, the liability of each genotype followed a truncated distribution with a cumulative probability, G j = h (h = 1, 2, 3), with
where h indexed the genotypes of the SDL, and f h was referred to as the relative fitness of the h th genotype . The expected frequencies of the three genotypes were
Mapping SDL under a liability model
We assumed that there was no crossing-over interference among the markers on the linkage group considered, an SDL caused segregation distortion of some or all markers linked to the SDL, and three genotypes for each marker had different viability coefficients. Let the order of the m markers on a same chromosome be M1, M2,...,M m ; x k be a dummy variable defined as x k = 1, 0, -1 for a homozygote of P1, a heterozygote and a homozygote of P2 at the k th marker, respectively; z k be indicator for phenotype of the k th marker (M k ); r k (or rk,k+1) be the recombination fraction between the k th and (k+1)th markers; and sk,1and sk,2(0 ≤ sk,1< +∞ and 0 ≤ sk,2< +∞ for k = 1, 2,...,m) be the viability coefficients of M k m k and m k m k relative to M k M k at the k th marker.
Now let an SDL locate between the k th and (k+1)th markers, and φ jh be the indicator function, taking the value of 1, if the j th individual belonged to the h th possible genotype in the F2 population, otherwise taking the value of zero. The parameters were Ω = (p AA , p Aa , p aa , δ) or Ω = (a, d, δ), with δ indicating the SDL location. The distribution of φ jh was described as
where n was sample size. The likelihood defined with matrix notation was
where = [Pr(x1 = 1), Pr(x1 = 0), Pr(x1 = -1)], c' = [1, 1, 1], ' denoted transpose of a matrix or vector, and the transition probability matrix H j (rk,k+1) from marker M k to Mk+1for the j th individual was
There were several ways to find the ML estimates (MLEs) of model parameters. We here adopted an EM algorithm  and treated φ jh as missing data. We regarded δ as constant for the moment, now the parameter set was θ = (a, d)'. For the EM algorithm, we needed to obtain the expectation of the complete data log-likelihood function,
where the constant C didn't depend on the parameters of interest, and but did depend on the viability coefficients and map distance between adjacent markers, which could be determined by Zhu et al . The EM algorithm was described as follows.
Provided the initial values for the model parameters, i.e., a(0) = 0.0 and d(0) = 0.0. The posterior probabilities of φ jh = 1 were
where (h = 1, 2, 3) was calculated from equation (3), and Pr(φ jh = 1|zj 1,...,z jM ) (j = 1, ...,n; h = 1, 2, 3) the prior probability of the h th genotype of SDL for the j th individual conditional on marker information () by means of the multipoint method .
The MLEs of parameters were obtained by the Fisher-scoring algorithm as it was impossible to get their explicit solutions . The θ could be updated by
θ(1) = θ(0) + I-1S(θ(0)) (8)
where S(θ(0)) was the score function, and I was the Fisher information matrix (more details were given in Appendix). And θ(1) would replace θ(0) in all subsequent estimating steps, and the procedure was iterated until the convergence occurred. The converged θ (1) was the MLEs of θ in this M-step.
The E and M steps were iterated until the convergence occurred.
The MLE for the SDL position could be obtained by examining the likelihood-ratio profile along the chromosome as was commonly done in interval mapping of QTL .
Following parameter estimation, we tested an overall null hypothesis that was no effect of SDL at the locus of interest (δ). The null hypothesis was formulated as H0: a = d = 0.0, which was tested using the likelihood-ratio (LR) test statistic:
LR = -2[lnL(0, 0, δ) - lnL(a, d, δ)]
Under the null hypothesis, the statistic LR approximately followed chi-square distribution with two degrees of freedom.
The critical value for power calculation was determined by computing 1,000 permutations , the experiment-wise type I error was set at 5%, and the confidence interval of an SDL location was determined by the bootstrapping method .
We simulated one chromosome of 100 cM (or 50 cM) long covered by m evenly spaced codominant markers (m = 6, 11 or 21) and put a single SDL at position 25 cM (another SDL was put at position 75 cM if necessary). The dominance ratio of the SDL was denoted by dr = d/a. Given the broad heritability (h2) and dr, the additive and dominant effects could be obtained using numerical algorithm . Based on the method described in Luo et al., all genotypes of both distorted markers and SDL for each individual in an F2 population were simulated. All simulations were replicated 100 or 1000 times depending on the purpose of the analyses. Empirical power was calculated by counting the number of runs in which test statistics were greater than the critical values .
Effects of various factors on SDL mapping
In this simulated experiments, the effects of sample size, SDL heritability and marker interval length on SDL mapping were studied, respectively. The performance of the proposed method was evaluated by statistical power, average and standard deviation of estimates with 100 replicates. All parameters and results were listed in Table 1. The results showed the general behavior of QTL mapping, i.e., the estimate for each parameter was very close to its corresponding true value, the power and the precision for SDL mapping increased with the increase in sample size and SDL heritability, respectively. However, marker interval length had slight effect on the power under the three levels studied.
Mapping multiple SDL
Similar to the interval mapping procedure of Lander and Botstein , the single-locus model for SDL mapping was used to search for multiple loci. Eleven markers were evenly placed on a simulated chromosome of length 100 cM. Two SDL each with a 0.5 dominance-ratio and a 0.15 heritability were respectively located at positions 25 cM and 75 cM on the simulated chromosome. One hundred independent simulation runs were performed for a sample size of 200. The results were listed in Table 2. Both loci were identified at almost 100% power. The results from simulation experiments demonstrated that the new method based on single-SDL model may be considered as an approximate approach to search for multiple loci if the SDL are sufficiently separated by markers.
A working example
As a demonstration of the proposed method in this paper, we re-analyzed a sample dataset (the source filename: sample.raw) in the MAPMAKER/QTL software . It consisted of 333 F2 individuals from a cross between two inbred lines in tomato. Each plant was genotyped for 12 marker loci that were divided into two linkage groups. Single-marker chi-square test showed that 5 and 2 markers on the first and second linkage groups deviated from Mendelian segregation ratios, respectively (data not shown). Given the reconstructed linkage maps using the method of Zhu et al., 1000 simulated datasets without segregation distortion were simulated and used to determine the critical value . The confidence interval of a SDL location was determined by the Bootstrap method .
The map distances between consecutive markers were calculated twice with and without considering SDL. The former was corrected map distance obtained from the method of Zhu et al. (2007) ; and the latter was uncorrected one using the Mapmaker/EXE 3.0 software . The results were listed in Table 3. The results showed that the corrected map distances differed from the uncorrected ones when there were distorted markers. The genetic reason of these inconsistencies would be discussed in the following section. Using the proposed method here, a total of four SDL were mapped (Table 4, Fig 1). Two SDL were on the first linkage group and the others on the second one. The genetic parameters for the four SDL were listed in Table 4. The results showed that the distortion was stronger for the first linkage group than for the second one (Fig 1). It resulted in a maximum difference between the corrected and uncorrected map distances for the first marker interval on the first linkage group. Moreover, two linked SDL on the second linkage group also gave rise to two big differences (Table 3). As compared to a single SDL, therefore, linked SDL had a larger effect on the estimate of map distance.
Effect of genetic model of SDL on the estimation of map distance
In this section, our purpose was to make clear the genetic reason for the inconsistencies between corrected and uncorrected map distances when there were distorted markers. Six evenly spaced codominant markers were simulated on a single-chromosome segment of length 50 cM. Two linked SDL with locations at positions 10 and 20 cM (exactly the 2nd and 3rd marker loci) were simulated on the simulated chromosome. One hundred simulation runs were performed for a sample size of 300. Each of datasets was analyzed twice by the method of Zhu et al. (2007)  and the Mapmaker/EXE 3.0 software . The former was corrected map distance and the latter uncorrected one. For an additive-dominant model, all genetic parameters and the results were listed in Table 5. Results showed that uncorrected map distance was underestimated for most cases, overestimated for opposite dominant effects, and unbiased for all negative additive effects. The results from the real dataset analysis above partly confirmed the result that opposite dominant effects of the two linked SDL on the second linkage group (Table 4) gave rise to the overestimation (Table 3). For an epistatic model, all genetic parameters and the results were listed in Table 6. Results showed that uncorrected map distance was underestimated for most situations, overestimated for negative additive-by-additive or negative dominant-by-dominant effects, and unbiased for additive-by-dominant effect. As we expected, corrected genetic distances were unbiased when considering SDL (Table 5 and 6). Hence, corrected linkage maps were recommended to be used for further QTL or SDL analysis unless there was strong evidence to believe that all markers presented typical Mendelian segregation.
For SDL mapping, most researchers concentrate their attention upon detecting and testing either the selection coefficients or the degree of dominance under the fitness model [7, 10, 11]. Luo et al. pioneered in the development of SDL mapping under a liability model. Zhu et al.  proposed a new method for the reconstruction of linkage maps with distorted, dominant and missing markers. Under the liability model, we developed a method to simultaneously estimate the position and the effects of SDL as well as the recombination fractions between adjacent markers. This approach remains the merits of Luo et al. but differs from others in several aspects. Firstly, it combines the detection of SDL with the reconstruction of marker linkage map. The position and the effect of SDL can be estimated along with the selection coefficient and the degree of dominance. Then, the proposed method may be used to elucidate the relationship between the viability selection and genetic linkage. Thirdly, the likelihood function is involved in the distribution of genotypes of SDL rather than that of marker genotypes in the previous studies [11, 28]. Finally, we adopted an EM algorithm rather than the Simplex procedure to estimate the genetic parameters. Of course, we should notice one common assumption of the mentioned-above approaches that marker segregation distortion is caused by some genetic or viability reasons. For genetic reason, there are two different mechanisms for segregation distortion, one at the gametic level and the other at the zygotic level. In both cases, observable phenotypes are distorted for marker loci in the chromosomal region close to the SDL. Thus the two mechanisms are included in our proposed method. Although we have no way to distinguish them in SDL mapping, the results from the genotype and allele tests  for the marker closest to the SDL can be used to infer the presence of zygotic or gametic viability selection in an F2 population but not in backcross, double haploid and recombinant inbred line populations. Moreover, it should be noted that genetic linkage between distorted markers has been carefully discussed in Wu et al. (2007) .
There are two primary routes by which selection can affect the extent of linkage disequilibrium . The first is a hitchhiking effect, in which an entire haplotype that flanks a favored variant can be rapidly swept to high frequency or even fixation . The second way in which selection can affect linkage disequilibrium is through epistatic selection for combination of alleles at two or more loci on the same chromosome . This selection form leads to the association of the particular alleles at different loci. The major difficulty in linkage disequilibrium-based mapping is to quantify the relationship between recombination fraction and linkage disequilibrium measurement. Our analyses are confined to exclude the factors that influence linkage disequilibrium except linkage and selection. We first combine the viability selection with quantitative genetics model, and then explore the relationship between genetic modes of the viability genes and the estimates of the recombination fraction. The simulation studies indicated that most of the genetic modes of the viability genes at the two linked SDL may result in underestimation of genetic distance. We hope that the tentative attempt will make for elucidating the genetic relationship between viability selection and genetic linkage.
In addition, it will be interesting and challenging to combine the SDL analysis with QTL mapping to see what the effects of distorted markers has on the results of QTL mapping. While doing this, one may take a risk of detecting false QTL not due to their genetic effects on the quantitative traits but due to violation of the Mendelian segregation law. It will be a great breakthrough in quantitative genetics area if we can develop a method to separate the effects of viability loci from the effects of QTL . By reason of the complexity of the combined analysis, the related investigations will be discussed separately elsewhere.
Our results suggested that the proposed method can serve as a powerful alternative to existing methods. Under the liability model, the new method can simultaneously estimate the position and the effects of SDL as well as the recombination fractions between adjacent markers, and also be used to probe into the genetic mechanism for the bias of uncorrected map distance and to elucidate the relationship between the viability selection and genetic linkage.
Appendix: Fisher-scoring algorithms for obtaining MLEs of parameters
The Fisher-scoring algorithm can be used to estimate parameters in the M-step of EM algorithm. Let θ = (a, d)T. The newly estimated θ can be expressed by the score-function vector S and the Fisher information matrix I,
where S = ∂lnL/∂θ = (∂lnL/∂a, ∂lnL/∂d)Tis score function, and
is Fisher information matrix.
More specifically, the score function and the Fisher information index of the expected complete data log-likelihood can be derived using
Let μ h = (2 - h)a + (-1)hd for h = 1, 2, 3
with or (-1)hwhen θ = a or d correspondingly. Hence
The second partial derivatives are more messy but a general form was found as
Lyttle TW: Segregation distortion. Annual Review of Genetics. 1991, 25: 511-557. 10.1146/annurev.ge.25.120191.002455.
Carr DE, Dudash MR: Recent approaches into the genetic basis of inbreeding depression in plants. Philos Trans R Soc London B. 2003, 358: 1071-1084. 10.1098/rstb.2003.1295.
Falconer DS, Mackay TFC: Introduction to Quantitative Genetics. 1996, London: Longman, Fourth
Harushima Y, Nakagahra M, Yano M, Sasaki N: Diverse variation of reproductive barriers in three intraspecific rice crosses. Genetics. 2002, 160: 313-322.
Hartl DL, Clark AG: Principles of population genetics. 1997, Sunderland (MA): Sinauer Associates, 3
Xu Y, Zhu L, Xiao J, Huang N, McCouch SR: Chromosomal regions associated with segregation distortion of molecular markers in F2, backcross, doubled haploid, and recombinant inbred populations in rice (Oryza sativa L.). Molecular General Genetics. 1997, 253: 535-545. 10.1007/s004380050355.
Fu YB, Ritland K: Evidence for the partial dominance of viability genes contributing to inbreeding depression in Mimulus guttatus. Genetics. 1994, 136: 323-331.
Ritland K: Inferring the genetic basis of inbreeding depression in plants. Genome. 1996, 39: 1-8.
Lander E, Botstein D: Mapping Mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics. 1989, 121: 185-199.
Hedrick PW, Muona O: Linkage of viability genes to marker loci in selfing organisms. Heredity. 1990, 64: 67-72.
Mitchell-Olds T: Interval mapping of viability loci causing heterosis in Arabidopsis. Genetics. 1995, 140 (3): 1105-1109.
Luo L, Xu SZ: Mapping viability loci using molecular markers. Heredity. 2003, 90: 459-467. 10.1038/sj.hdy.6800264.
Wang CM, Zhu CS, Zhai HQ, Wan JM: Mapping segregation distortion loci (SDL) and quantitative trait loci (QTL) for spikelet sterility in rice (Oryza sativa L.). Genet Res. 2005, 86: 97-106. 10.1017/S0016672305007779.
Luo L, Zhang YM, Xu SZ: A quantitative genetics model for viability selection. Heredity. 2005, 94: 347-355. 10.1038/sj.hdy.6800615.
Nichols RA: Quantitative genetics focus issue. Heredity. 2005, 94: 273-274. 10.1038/sj.hdy.6800646.
Lorieux MB, Perrier GX, Gonzalez de Leon , Lanaud C: Maximum likelihood models for mapping genetic markers showing segregation distortion. 1. Backcross population. Theor Appl Genet. 1995, 90: 73-80.
Lorieux M, Perrier X, Goffinet B, Lanaud C, Gonzalez de Leon D: Maximum likelihood models for mapping genetic markers showing segregation distortion. 2. F2 population. Theor Appl Genet. 1995, 90: 81-89.
Zhu C, Wang C, Zhang YM: Modeling segregation distortion for viability selection I. Reconstruction of linkage maps with distorted markers. Theor Appl Genet. 2007, 114: 295-305. 10.1007/s00122-006-0432-x.
Nelder JA, Mead R: A simplex method for function minimization. The Computational Journal. 1965, 7: 308-313.
Dempster AP, Laird NM, Rubin DB: Maximum likelihood from incomplete data via EM algorithm. J Royal Stat Soc B. 1977, 39: 1-38.
Rao SQ, Xu SZ: Mapping quantitative trait loci for ordered categorical traits in four-way crosses. Heredity. 1998, 81: 214-224. 10.1038/sj.hdy.6883780.
Bailey NTJ: Introduction to the mathematical theory of genetic linkage. 1961, Great Britain: Oxford University Press
Churchill GA, Doerge RK: Empirical threshold values for quantitative trait mapping. Genetics. 1994, 138: 963-971.
Visscher PM, Thompson P, Haley CS: Confidence intervals in QTL mapping by bootstrapping. Genetics. 1996, 143: 1013-1020.
Press WH, Flanner BP, Teukolsky SA, Vellerting WT: Numerical Recipes in C++: The Art of Scientific Computing. 2nd version. 2001, Cambridge University Press, New York
Carbonell EA, Gerig TME, Balansard E, Asins MJ: Interval mapping in the analysis of non-additive quantitative trait loci. Biometrics. 1992, 48: 305-315. 10.2307/2532757.
Lander E, Green P, Abrahamson J, Barlow A, Daly MJ, Lincoln SE, Newburg L: MAPMAKER: An interactive computer package for construction primary genetic linkage maps of experimental and natural populations. Genomics. 1987, 1: 174-181. 10.1016/0888-7543(87)90010-3.
Huang H, Richardson TE, Carson SD, Bongarten BC: Genetic analysis of inbreeding depression in plus tree 850.55 of Pinus radiate D. Don. II Genetics of viability genes. Theor Appl Genet. 1999, 99: 140-146. 10.1007/s001220051218.
Pham JL, Glaszmann JC, Sano R, Barbier P, Ghesquiere A, Second G: Isozyme markers in rice: genetic analysis and linkage relationships. Genome. 1990, 33: 348-359.
Wu R, Ma C, Casella G: Statistical Genetics of Quantitative Traits: Linkage, Maps, and QTL. 2007, Springer, New York, 123-134.
Ardlie KG, Kruglyak L, Seielstad M: Patterns of linkage disequilibrium in the human genome. Nat Rev Genet. 2002, 3: 299-309. 10.1038/nrg777.
Lewontin RC: The interaction of selection and linkage. I. General considerations: heterotic models. Genetics. 1964, 49: 49-67.
Cannon GB: The effects of heterozygosity and recombination on the relative fitness of experimental populations of Drosophila melanogaster. Genetics. 1963, 48: 919-942.
We are grateful to the Associate Editor and two anonymous reviewers for their constructive comments and suggestions that significantly improved the presentation of the manuscript. The research was supported in part by 973 program (2006CB101708), the National Natural Science Foundation of China (No.30470998; No.30671333), NCET (NCET-05-0489), Specialized Research Fund for the Doctoral Program of Higher Education (20060307008), the Talent Foundation of Nanjing Agricultural University to YMZ; China (No.2005038246) and Jiangsu province (No.0502012C) Postdoctoral Science Foundation to CSZ; and the Program for Changjiang Scholars and Innovative Research Team in University, the Ministry of Education (IRT0432).
CZ designed and carried out the simulation study, and drafted the manuscript. YMZ conceived of the study, participated in the design, coordinated it and revised the manuscript. All authors read and approved the final manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
About this article
Cite this article
Zhu, C., Zhang, Y. An EM algorithm for mapping segregation distortion loci. BMC Genet 8, 82 (2007) doi:10.1186/1471-2156-8-82
- Quantitative Trait Locus
- Linkage Group
- Quantitative Trait Locus Mapping
- Segregation Distortion
- Fisher Information Matrix