Generalized linear mixed model for segregation distortion analysis
© Zhan and Xu; licensee BioMed Central Ltd. 2011
Received: 23 September 2011
Accepted: 11 November 2011
Published: 11 November 2011
Skip to main content
© Zhan and Xu; licensee BioMed Central Ltd. 2011
Received: 23 September 2011
Accepted: 11 November 2011
Published: 11 November 2011
Segregation distortion is a phenomenon that the observed genotypic frequencies of a locus fall outside the expected Mendelian segregation ratio. The main cause of segregation distortion is viability selection on linked marker loci. These viability selection loci can be mapped using genome-wide marker information.
We developed a generalized linear mixed model (GLMM) under the liability model to jointly map all viability selection loci of the genome. Using a hierarchical generalized linear mixed model, we can handle the number of loci several times larger than the sample size. We used a dataset from an F2 mouse family derived from the cross of two inbred lines to test the model and detected a major segregation distortion locus contributing 75% of the variance of the underlying liability. Replicated simulation experiments confirm that the power of viability locus detection is high and the false positive rate is low.
Not only can the method be used to detect segregation distortion loci, but also used for mapping quantitative trait loci of disease traits using case only data in humans and selected populations in plants and animals.
Segregation distortion refers to a phenomenon that the observed genotypic frequencies deviate significantly from the expected Mendelian frequencies . Different populations have different Mendelian ratios, e.g., the typical Mendelian ratio for an F2 population is 1:2:1 for the three genotypes A 1 A 1: A 1 A 2: A 2 A 2. Many reasons can explain the observed distortion [2–7]. The most promising explanation is viability selection on the distorted markers or loci linked to the markers . In genetic mapping for quantitative traits, the basic assumption is Mendelian segregation . Therefore, distorted markers are usually discarded prior to QTL mapping because people usually fear unexpected consequences of distorted markers on the results. In a recent study , we found that segregation distortion is not necessarily harmful to QTL mapping; rather, it can help in some circumstances. Consequently, we can incorporate segregation distortion into existing QTL mapping programs .
It appears that segregation distortion is common rather than rare. If segregation distortion is indeed caused by viability selection loci, these loci themselves are of interest because they may help to understand the mechanism of natural selection and evolution. Chi-square tests are commonly used to test segregation distortion. Fu and Ritland  and Lorieux et al.  developed maximum likelihood methods to map segregation distortion loci. The methods are interval mapping approaches in which one distortion locus is tested at a time. Vogl and Xu  used an MCMC implemented Bayesian algorithm to detect multiple segregation loci simultaneously. These methods are quite different from the usual QTL mapping procedures in quantitative trait genetic mapping. Luo and Xu  first developed an expectation and maximization (EM) algorithm for mapping viability selection loci. This method takes advantage of the well known EM algorithm in interval mapping. Recently, Luo et al.  developed a quantitative genetic model to map viability loci. The authors postulated a hidden underlying liability for each individual. The liability is an unobserved quantitative trait and natural selection acts on the liability. The method of Luo et al.  actually maps loci controlling the hidden liability (a quantitative trait). Therefore, methods of QTL mapping and viability locus mapping have been unified into the same framework of interval mapping. Both methods are called QTL mapping, but the traits mapped are different, the former maps observed quantitative traits and the latter maps unobserved liability.
The quantitative genetic model of Luo et al.  is an interval mapping approach. The state-of-the-art QTL mapping procedure is the Bayesian shrinkage method [17–19] because it simultaneously evaluates the entire genome. It is natural to extend the Bayeisan shrinkage method to map multiple viability loci. The Markov chain Monte Carlo (MCMC) algorithm is commonly used to implement the Bayesian method. Such a sampling based method is time consuming. A fast version of the Bayesian method is the empirical Bayesian method  where the variance components in the prior distributions of QTL effects are first estimated from the data and then used as the priors to estimate the QTL effects under the general Bayesian framework. This method is essentially the linear mixed model approach. When applied to discrete traits, the method is called the generalized linear mixed model [21, 22].
Numerous algorithms have been developed to implement the generalized linear mixed model. The pseudo likelihood algorithm [23–25] appears to be the most popular one. The method requires a normal transformation of the original data point using the first step Newton-Raphson update. Once the data points are normally transformed, they are treated as normal quantitative phenotypes. The usual linear mixed model applies to the transformed data points. The difference between the Newton-Raphson transformation and the data transformation commonly seen in data analysis is that the Newton-Raphson transformation is a function of the data point and parameters while the usual data transformation is a function of the data point only. Therefore, the Newton-Raphson transformation is required for each cycle of the iteration process.
It is not clear how to use the pseudo likelihood approach to mapping viability loci because there is no phenotypic data point to transform. However, the method of McGilchrist  for generalized linear mixed model can be applied here. This method only requires a linear predictor, a likelihood and a prior distribution for each effect in the linear predictor. In this study, we used the McGilchrist's  method to perform parameter estimation.
is the expected Mendelian ratio. In an F2 population, the expected Mendelian ratio is . Note that if γ k = 0, vector π j = [π j(11) π j(12) π j(22)] will be equivalent to the expected Mendelian ratio for every individual at the locus.
is the linear predictor excluding the gender effect.
Proof of this equation (16) is straightforward and thus given in the next paragraph.
This concludes the derivation of equation (16) presented in the previous paragraph.
where a constant has been ignored.
Prior distribution for the non-genetic effect is assumed to be uniform (uninformative prior) and thus only the likelihood is needed to find the posterior mode estimate of β.
Due to the possible large number of parameters, we take a sequential approach to estimating the posterior mode parameters with one locus at a time. This approach is also called the coordinate descent algorithm. Once the parameters of all loci are updated, the sequence is repeated until a certain criterion of convergence is reached.
where τ + 1 is the degree of freedom for the inverse Wishart posterior and the number 2 represents the dimension of vector γ k .
The iteration process of the posterior mode estimation is summarized as follows.
Step 0: Initialize all parameters.
Step 1: Update the non-genetic effect using equation (29).
Step 2: Update effect of marker k for k = 1, ⋯, p using equation (26).
Step 3: Update ∑ k for k = 1, ⋯, p using equation (28).
Step 4: Repeat step 1 to step 3 until the iteration process converges.
The liability model has unified QTL mapping and SDL mapping in the same framework of quantitative genetics.
We used a published dataset of an F2 mouse experiment to demonstrate the application of the method. The dataset was published by Lan et al.  and is freely available from the internet. The mouse genome has 19 chromosomes (excluding the sex chromosome). The data contains 110 F2 ob/ob mice derived from the cross of two inbred lines (BT×BTBR) and 193 markers covering 1,800 cM of the entire mouse genome. The average marker distance was 9.35 cM per marker interval. We inserted one or more pseudo markers in intervals larger than 5 cM to make sure that the entire genome is evenly covered by (pseudo or true) markers with no intervals larger than 5 cM. The number of pseudo markers inserted was 273, resulting in a total of 466 markers (193 true and 273 pseudo markers). For the pseudo markers, the genotype indicator variable, w j = [w j(11) W j(12) w j(22)], is missing for every individual. In the data analysis, the missing variable was replaced by the conditional probability calculated using the multipoint method .
respectively, for A 1 A 1, A 1 A 2 and A 2 A 2.
Estimated parameters of the QTL identified by GLMM compared to true values in the simulation.
Average estimates of effects and powers of simulated QTL and co-factors from 100 replicated simulations.
Genome-wide segregation distortion is a common phenomenon in genetic mapping, but it is usually ignored. The main reason is the difficulty in joint estimation and tests of the segregation distortion loci. We formulated the problem as a typical quantitative genetics problem using a hypothetical liability to describe the fitness of each individual. Using a generalized linear mixed model, we were able to estimate and test genome-wide quantitative trait loci controlling the hidden liability. We used a mouse dataset to demonstrate the method and detected a major QTL for the liability that explains 93% of the liability variance. The simulated data experiment showed that the method can detect a QTL (e.g., the second QTL simulated) explaining 7.71% of the liability variation with 71% power. The method was implemented in a SAS/IML program. The code is posted on our website (http://www.statgen.ucr.edu) for general application. With this method and the program, genome-wide segregation distortion can be investigated routinely in future genetic data analysis.
As a Bayesian method, there are a rich array of prior distributions can be explored. In this study, we used the inverse Wishart as the prior distribution for the prior variance matrix of QTL effects. For the additive genetic model (one effect per locus), the inverse Wishart distribution becomes a scaled inverse Chi-square distribution. It is possible to use the exponential distribution (the Lasso prior) as an alternative prior . Because the method uses multiple levels of prior choice, the model can also be called hierarchical generalized linear mixed model [24, 33]. This study opens a new area in statistical genetics and further studies are expected to arise. For example, how to use the adaptive Lasso  to address this problem is entirely unknown and can be explored in the future.
A caveat of this method is the requirement of Mendelian segregation ratio (before the selection). For populations generated through line crossing experiments, Mendelian ratios are known. However, for uncontrolled populations, the theoretical Mendelian frequencies are not available. In this case, one needs to survey the unselected population to obtain the genotypic frequencies as the controlled "Mendelian segregation". If one can genotype both the selected and unselected individuals, one may simply use the case-control study and there is little reason to use this case-only study approach. In reality, genotyping individuals is much more costly than pooling the DNA of a sample of individuals. The cost effective approach is to genotype each individual in the surviving sample and genotype the pooled DNA sample for the unselected population because we only need the frequencies of genotypes (not the genotypes of individuals) in the unselected population. For the co-factors, we also need the expected frequencies of the co-factors in the unselected population. We examined the sex effect (discrete co-factor) and a normally distributed co-factor. The expected 1:1 sex ratio was used as the expected frequency. For the normal co-factor, we used the mean and variance of the co-factor used in the simulation (the true values) to construct the expected distribution. In reality, one needs to survey the entire population to obtain the expected distribution. For continuous variables deviating from normality, one may discretize a variable to a few groups. For example, age is a quantitative variable but one can arbitrarily divide individuals into a few age groups. This discretization will eliminate the restriction of normal distribution.
The method developed here can be applied to more broad situations beyond genetics without much modification. For example, if we know the joint distribution of k variables in a base (unselected) population and the joint distribution of the variables in a selected sample. We can simply test the difference between the two distributions to see which variables influence more on the selection. However, the pair-wise covariance may not allow us to make a precise decision on the importance of each variable. If two variables both influence the selection and they are highly correlated, the influence of one variable may be simply caused by the high correlation with the true causal variable. The proposed method here can help separate the true causality from the influence due to correlation.
QTL mapping is usually conducted in unselected populations. Individuals with undesired phenotypes must also be evaluated to obtain unbiased estimates of QTL effects. This is not a cost effective approach in breeding companies. Breeders wish to use only selected individuals to breed and keep no records for the unselected individuals. If we only evaluate the selected individuals, markers associated with the traits of interest will show distorted segregation. If the selection criterion is not well defined, for example, drought resistance, it is hard to map QTL. The segregation distortion loci are actually the QTL for drought resistance if one knows that there is no segregation distortion in the unselected population. The method developed here can be directly applied to mapping drought resistance QTL. Because we can perform QTL mapping using selected population, this approach may be called "mapping while selecting". For example, breeders may want to evaluate drought resistance of a family of recombinant inbred lines (RIL) by planting all seeds in a harsh drought environment. Eventually all plants die except the ones with strong resistance of drought. Breeders may have no records of the plants eliminated, but they can still perform QTL mapping for this trait (drought resistance) using all plants that have survived the selection. Other stress related traits can also be mapped using this approach, e.g., pest and salinity resistances.
In human genetics, case-control study is a common approach for mapping disease loci. In situations where there are no records for the control but the case, this case-only study may benefit from the new method. For example, one may easily get patient data from hospitals but hardly has individual records for the entire population. QTL mapping for the disease trait is still possible if we have the population records (frequencies) of genotypes in the entire population.
In summary, we developed a hierarchical generalized linear mixed model to map QTL for liability. This is a new approach to genetic mapping. It incorporates a seemingly different problem (segregation distortion) into the same QTL mapping framework for quantitative traits. Statistically, it shows that the generalized linear mixed model can be applied to situations where there are no phenotypic records; one only needs a likelihood function, a linear predictor and a prior distribution to infer the posterior mode estimation of the model effects.
We greatly appreciate two anonymous reviewers and the associated editor for their comments on an early version of the manuscript and their suggestions in revision of the manuscript. The project was supported by the USDA National Institute of Food and Agriculture Grant 2007-02784 to SX.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.