Generalized linear mixed model for segregation distortion analysis

Background Segregation distortion is a phenomenon that the observed genotypic frequencies of a locus fall outside the expected Mendelian segregation ratio. The main cause of segregation distortion is viability selection on linked marker loci. These viability selection loci can be mapped using genome-wide marker information. Results We developed a generalized linear mixed model (GLMM) under the liability model to jointly map all viability selection loci of the genome. Using a hierarchical generalized linear mixed model, we can handle the number of loci several times larger than the sample size. We used a dataset from an F2 mouse family derived from the cross of two inbred lines to test the model and detected a major segregation distortion locus contributing 75% of the variance of the underlying liability. Replicated simulation experiments confirm that the power of viability locus detection is high and the false positive rate is low. Conclusions Not only can the method be used to detect segregation distortion loci, but also used for mapping quantitative trait loci of disease traits using case only data in humans and selected populations in plants and animals.


Background
Segregation distortion refers to a phenomenon that the observed genotypic frequencies deviate significantly from the expected Mendelian frequencies [1]. Different populations have different Mendelian ratios, e.g., the typical Mendelian ratio for an F 2 population is 1:2:1 for the three genotypes A 1 A 1 : A 1 A 2 : A 2 A 2 . Many reasons can explain the observed distortion [2][3][4][5][6][7]. The most promising explanation is viability selection on the distorted markers or loci linked to the markers [8]. In genetic mapping for quantitative traits, the basic assumption is Mendelian segregation [9]. Therefore, distorted markers are usually discarded prior to QTL mapping because people usually fear unexpected consequences of distorted markers on the results. In a recent study [10], we found that segregation distortion is not necessarily harmful to QTL mapping; rather, it can help in some circumstances. Consequently, we can incorporate segregation distortion into existing QTL mapping programs [11].
It appears that segregation distortion is common rather than rare. If segregation distortion is indeed caused by viability selection loci, these loci themselves are of interest because they may help to understand the mechanism of natural selection and evolution. Chisquare tests are commonly used to test segregation distortion. Fu and Ritland [12] and Lorieux et al. [13] developed maximum likelihood methods to map segregation distortion loci. The methods are interval mapping approaches in which one distortion locus is tested at a time. Vogl and Xu [14] used an MCMC implemented Bayesian algorithm to detect multiple segregation loci simultaneously. These methods are quite different from the usual QTL mapping procedures in quantitative trait genetic mapping. Luo and Xu [15] first developed an expectation and maximization (EM) algorithm for mapping viability selection loci. This method takes advantage of the well known EM algorithm in interval mapping. Recently, Luo et al. [16] developed a quantitative genetic model to map viability loci. The authors postulated a hidden underlying liability for each individual. The liability is an unobserved quantitative trait and natural selection acts on the liability. The method of Luo et al. [16] actually maps loci controlling the hidden liability (a quantitative trait). Therefore, methods of QTL mapping and viability locus mapping have been unified into the same framework of interval mapping. Both methods are called QTL mapping, but the traits mapped are different, the former maps observed quantitative traits and the latter maps unobserved liability.
The quantitative genetic model of Luo et al. [16] is an interval mapping approach. The state-of-the-art QTL mapping procedure is the Bayesian shrinkage method [17][18][19] because it simultaneously evaluates the entire genome. It is natural to extend the Bayeisan shrinkage method to map multiple viability loci. The Markov chain Monte Carlo (MCMC) algorithm is commonly used to implement the Bayesian method. Such a sampling based method is time consuming. A fast version of the Bayesian method is the empirical Bayesian method [20] where the variance components in the prior distributions of QTL effects are first estimated from the data and then used as the priors to estimate the QTL effects under the general Bayesian framework. This method is essentially the linear mixed model approach. When applied to discrete traits, the method is called the generalized linear mixed model [21,22].
Numerous algorithms have been developed to implement the generalized linear mixed model. The pseudo likelihood algorithm [23][24][25] appears to be the most popular one. The method requires a normal transformation of the original data point using the first step Newton-Raphson update. Once the data points are normally transformed, they are treated as normal quantitative phenotypes. The usual linear mixed model applies to the transformed data points. The difference between the Newton-Raphson transformation and the data transformation commonly seen in data analysis is that the Newton-Raphson transformation is a function of the data point and parameters while the usual data transformation is a function of the data point only. Therefore, the Newton-Raphson transformation is required for each cycle of the iteration process.
It is not clear how to use the pseudo likelihood approach to mapping viability loci because there is no phenotypic data point to transform. However, the method of McGilchrist [26] for generalized linear mixed model can be applied here. This method only requires a linear predictor, a likelihood and a prior distribution for each effect in the linear predictor. In this study, we used the McGilchrist's [26] method to perform parameter estimation.

Liability model and viability selection
Let us define a continuous variable y j as the liability for individual j, where ε j~N (0,1) is a residual error with a standardized normal distribution. Other model effects are defined as follows. There may be some effects not related to genetics, such as age, location and other systematic effects, and these effects are captured by b and the design matrix X. There are p genetic loci each with an effect g k for k = 1, ..., p. The value of Z jk is determined by the genotype of individual j at locus k. For example, an F 2 individual derived from the cross of two inbred lines can take one of three genotypes, Under the additive genetic model, Z jk is defined as and g k = a k is the additive genetic effect for locus k.
Under the dominance effect model, the genetic effect for locus k is a 2 × 1 vector g k = [a k d k ] T , where d k is called the dominance effect. The corresponding Z variable is also a vector and defined as where H i is the i-th row of matrix H, as shown below, The liability y j is not observed but it determines the viability of individual j. It is assumed that individual j will survive if y j > 0 and die otherwise. Since we can only observe the surviving individuals, all individuals in the sample have liabilities greater than zero. This will cause the selected population to deviate from the expected Mendelian segregation ratio for loci responsible for viability selection and all loci linked to the viability loci. Although all individuals have survived, some may have a high liability and some may have a low liability, but all have a liability greater than zero. We now use the concept of penetrance to describe the survivability of an individual. Let be the expectation of the unobserved liability (a linear predictor). We use the normal or the logistic function to model the probability of survival for individual j, i.e., Φ(h j ) or logistic(h j ) = exp(h j )/[1 + exp(h j )]. Conditional on the genotypes of all other loci, the penetrances for the three genotypes of locus k are defined as for for for where is the linear predictor excluding locus k. This model was first introduced by Luo et al. (2005) for single locus analysis, which does not include h j(-k) in equation (6). The data that allow us to estimate g k is the genotype array for all individuals at locus k. Define w j = w j (11) w j (12) w j (22) as a multivariate Bernoulli variable with three categories (i.e., a multinomial variable with sample size one). If individual j has a genotype A 1 A 1 , then w j(11) = 1 and w j(12) = w j(22) = 0. The probabilities of individual j taking the three genotypes are derived from the Bayes' theorem, wherē π j = φ 11 (H 1 γ k + η j(−k) ) is the mean of the three penetrances and is the expected Mendelian ratio. In an F 2 population, the expected Mendelian ratio is φ = 1 4 2 4 1 4 . Note that if g k = 0, vector π j = [π j(11) π j(12) π j (22) ] will be equivalent to the expected Mendelian ratio for every individual at the locus.
If there is no factor to be considered other than the markers, the term X j b should disappear here. This is different from the usual linear regression analysis where an intercept should always appear in the model. With the liability selection model, there is no intercept. We now assume only one co-factor to consider. The X j variable can be discrete or continuous, but the distribution in the unselected population must be known. In this study, we first assume that X j is discrete, say gender, a variable indicating the gender of individual j with X j = 1 representing male and X j = -1 representing female. In the unselected population, the sex ratio should be 1:1. If the population evaluated has a biased sex ratio, this means that the gender has an effect on the liability. We can estimate the gender effect b on the liability. Let ϕ = ϕ 1 ϕ 2 = 1 2 1 2 be the expected sex ratio (prior to the selection). Define ξ j(1) or ξ j(2) as the posterior probability that individual j is male or female, respectively. These posteriors are calculated using wherē is the mean penetrance of the two genders and is the linear predictor excluding the gender effect. We now assume that X j is a continuous non-genetic effect, e.g., age. Let us assume that X j follows a normal distribution in the unselected population, i.e., p(X j ) = N (X j |μ, s 2 ), where μ and s 2 are known. Let b be the effect of X j on the liability. Define Φ(X j b + h j(-b) ) as the probability that individual j has survived the selection. The posterior probability is defined as wherē Proof of this equation (16) is straightforward and thus given in the next paragraph.
Let f(X j ) = N(X j |μ, s 2 ) be the normal density for variable X j with known μ and s 2 . The following Lemma [27] is used to derive equation (16).
Let us rewrite equation (16) as Comparing equation (18) with equation (17), we can see that ξ = -h j(-b) /b and l 2 = 1/b 2 . Substituting these into equation (17), we get This concludes the derivation of equation (16) presented in the previous paragraph.

Likelihood, prior and posterior
It is difficult (if not impossible) to construct the joint likelihood for all loci, but conditional on the effects and the genotypes of other loci, the likelihood for locus k can be derived based on the multivariate Bernoulli distribution, that is L(γ k ) = n j=1 w j (11) ln(π j(11) ) + w j (12) ln(π j (12) ) + w j (22) ln(π j (22) ) (20) The exact notation for this log likelihood should be L (g k |h (-k) ) because it is conditioned on the gender effect and effects of other loci. We use the simplified notation to improve the readability. Let us assign a normal prior to g k , i.e., Furthermore, we assign a hierarchical prior to ∑ k , where τ is the prior degree of freedom and ω is the prior scale matrix with the same dimension as ∑ k . The reason for assigning these prior distributions is to handle a possible large number of loci involved in the model. Uniform prior for the gender effect is assumed. The log posterior (denoted by LogPost) is where a constant has been ignored.
For the sex effect (discrete co-factor), the likelihood For the continuous co-factor, the log likelihood for parameter b can be written as Prior distribution for the non-genetic effect is assumed to be uniform (uninformative prior) and thus only the likelihood is needed to find the posterior mode estimate of b.

Posterior mode estimation
Due to the possible large number of parameters, we take a sequential approach to estimating the posterior mode parameters with one locus at a time. This approach is also called the coordinate descent algorithm. Once the parameters of all loci are updated, the sequence is repeated until a certain criterion of convergence is reached.
Let us define the first step of the Newton-Raphson iteration as and denote the variance of this updated parameter by where the first and second partial derivatives are evaluated at γ k = γ (t) k . The posterior mean and posterior variance matrix for g k at iteration t are denoted by and var(g k ) = V k , respectively. Since the posterior distribution of g k is approximately multivariate normal (asymptotical theory), the posterior mean is identical to the posterior mode. The posterior of ∑ k remains scaled inverse Wishart due to the conjugate property of the prior. Therefore, the posterior mode of ∑ k is where τ + 1 is the degree of freedom for the inverse Wishart posterior and the number 2 represents the dimension of vector g k .
The posterior mode estimation of b conditional on the effects of all loci is ∂β (29) with an estimation error variance approximated by The iteration process of the posterior mode estimation is summarized as follows.
Step 4: Repeat step 1 to step 3 until the iteration process converges.

Genetic contribution from an individual locus
An obvious advantage of the liability model is that we are able to calculate the proportion of the liability variance contributed by each SDL, similar to the proportion of quantitative trait variance contributed by each QTL. Suppose that we have detected one SDL with both additive and dominance effects. The theoretical variances of the Z variables in an F 2 population are 0.5 for the additive part and 1.0 for the dominance part. The reason is that the three genotypes are coded as +1, 0 and -1 for the additive Z and -1, 1 and -1 for the dominance Z [28]. Let a k and d k be the additive and dominance effects of this SDL. The genetic variance explained by this locus is The residual variance of the liability is set at unity and thus the variance of the liability is The broad sense heritability is defined as This is the proportion of the liability variance contributed by the kth SDL. Assuming that the multiple SDL are not closely linked, the overall contribution from all SDL is approximated by The liability model has unified QTL mapping and SDL mapping in the same framework of quantitative genetics.

Mouse experiment
We used a published dataset of an F 2 mouse experiment to demonstrate the application of the method. The dataset was published by Lan et al. [29] and is freely available from the internet. The mouse genome has 19 chromosomes (excluding the sex chromosome). The data contains 110 F 2 ob/ob mice derived from the cross of two inbred lines (BT×BTBR) and 193 markers covering 1,800 cM of the entire mouse genome. The average marker distance was 9.35 cM per marker interval. We inserted one or more pseudo markers in intervals larger than 5 cM to make sure that the entire genome is evenly covered by (pseudo or true) markers with no intervals larger than 5 cM. The number of pseudo markers inserted was 273, resulting in a total of 466 markers (193 true and 273 pseudo markers). For the pseudo markers, the genotype indicator variable, w j = [w j(11) W j (12) w j (22) ], is missing for every individual. In the data analysis, the missing variable was replaced by the conditional probability calculated using the multipoint method [30].
The top panel of Figure 1 shows the frequencies of the three genotypes, A 1 A 1 , A 1 A 2 and A 2 A 2 , plotted against the mouse genome. It is obvious that there is a severe distortion in the beginning of chromosome 6 where the population contains almost exclusively the A 2 A 2 genotypes with A 1 A 1 and A 1 A 2 almost eliminated from the population. Chromosomes 14 and 18 also show mild segregation distortion. Interval mapping for segregation distortion using the QTL procedure in SAS [31] showed that the LOD score for chromosome 6 is 43.25 (see the bottom panel of Figure 1 for the LOD score profile obtained from the interval mapping analysis). The interval mapping procedure [31] is a separate analysis for each marker. With the interval mapping, the position with the highest LOD score (43.25) occurred at a pseudo marker (at position 15.69 cM) between the first true marker (D6Mit86, 0 cM) and the second true marker (D6Mit224, 30.4 cM) on chromosome 6. The estimated frequencies of this pseudo marker are 0.0000, 0.0001 and 0.9999 for the three genotypes (A 1 A 1 , A 1 A 2 and A 2 A 2 ), respectively.
We used the generalized linear mixed model to analyze all the 466 markers (193 true and 273 pseudo) jointly. In the mouse data, among the 110 mice, 52 were male and 58 were female. Apparently, the sex ratio is not biased and thus sex appears to have no effect on the survivorship. However, we included the sex factor as a fixed effect in the model to test the robustness of our model. We expected that our model would detect no sex effect on the survivorship. The generalized linear mixed model had 466 × 2 + 1 = 933 model effects, including 466 additive effects, 466 dominance effects Figure 1 Frequencies of the three genotypes and LOD score profiles of the mouse genome. This is an F2 population derived from the cross of two inbred lines (BT×BTBR). (a) The top panel shows the frequencies of three genotypes with the blue, red and green patterns representing the A 1 A 1 , A 1 A 2 and A 2 A 2 genotypes, respectively. (b) The bottom panel shows the LOD score profiles for the mouse genome obtained from the interval mapping of segregation distortion. The profile in red (LOD SDL) represents the LOD score for segregation distortion. The curves in blue and black are the LOD scores for QTL of the 10 week body weight and joint testing of the QTL and segregation distortion. and one sex effect. This GLMM with 110 individuals was indeed able to handle such a large model (933 model effects). The hyper parameters used in the analysis was (τ, ω) = (0,0), equivalent to the Jeffrey's prior for the variance components. The estimated additive and dominance effects along with the corresponding LOD scores are depicted in Figure 2. One segregation distortion locus was detected on chromosome 6 (same as that of the interval mapping). The location of this distortion locus is right at the first marker of chromosome 6 (D6Mit86, 0 cM). The interval mapping approach described in the previous paragraph also detected a segregation distortion locus. However, the SDL detected by interval mapping was located halfway (15.69 cM) between markers D6Mit86 (0 cM) and D6Mit224 (30.4 cM) (see Figure 1 for the result of interval mapping). The GLMM analysis also showed some distortion for the second marker (D6Mit224, 30.4 cM), but the LOD score is only 3, barely significant. Therefore, we can safely ignore this locus due to linkage with the first marker. Let us go back to the first marker D6Mit86, the major SDL detected by the GLMM method. This segregation distortion locus is caused by both the additive and dominance effects. The estimated additive effect (± standard error) isâ = 4.6230 ± 0.4248 while the estimated dominance effect (± standard error) iŝ d = −1.6656 ± 0.1833 . The LOD scores are 25.69 and 17.92, respectively, for the additive and dominance effects. Simulation experiment under the null hypothesis (Mendelian segregation) showed that the 95% value of the null distribution of the LOD scores is 3.8, much smaller than the actual LOD score of 25.69. Therefore, we are very confident for this detected segregation distortion locus. As expected, the estimated sex effect iŝ β = 0.1969 ± 0.3002 with a LOD score of 0.0934, smaller than 1.0255, the 95% value of the LOD score generated under the null model. Therefore, we can safely claim that the gender effect is insignificant.
In the GLMM analysis, the QTL effect has been interpreted as an effect on a hypothetical liability. The total variance of the liability is (see the Method section) The expected frequencies for the three genotypes are respectively, for A 1 A 1 , A 1 A 2 and A 2 A 2 .

Simulation experiment
We simulated a single chromosome with 2400 cM in length covered by 481 markers evenly placed on the genome with 5 cM per marker interval. The additive QTL effects of six markers were simulated with the true positions and true effects as presented in Figure 3 (bottom panel). Dominance effects were not simulated (zero values) although they were included in the data analysis. Frequencies of the three genotypes of a simulated F 2 family with 500 individuals are also presented in Figure  3 (top panel). We also simulated two co-factors that influence the liability. The first co-factor was the sex effect coded as 1 for male and -1 for female with an effect value of b 1 = 1.0. The second co-factor was a continuous variable with μ = 0 and s 2 = 0.025. The effect of this co-factor on the liability was b 2 = 1.0. The liability of each individual was generated using the linear model containing the two cofactors and the six QTL. An individual with a liability greater than 0 survived the selection, otherwise, it was eliminated. All the 500 individuals in the sample survived the selection. The simulated data were analyzed using the generalized linear mixed model with (τ, ω) = (0,0) as the hyper-parameter values. The estimated additive effects and the LOD scores are given in Figure 4. The estimated dominance effects and LOD scores were all near zero and thus not presented in the figure. Critical value of the LOD score generated from the null model was 2.99, which is smaller than the LOD score of each identified QTL. Therefore, all the six QTL have been identified by the method with no false positive identification. Figure 5 gives the estimated QTL effects and LOD scores for a dataset simulated under the null model. We can see that both the effects and the LOD scores are extremely small. The estimated QTL effects from simulation experiment (not the null model) are also presented in Table 1 along with the true values. Except QTL 5, all other QTL have been identified at the positions where they were simulated. QTL 5 was missed at the simulated position (1750 cM) but the effect was picked up at position 1735 cM, 15 cM away from the true position. The six QTL plus the two co-factors contributed 84.55% of the total variation of the liability and the estimated proportion was 82.74%, very close to the true proportion. The simulated data analysis demonstrates that the generalized linear mixed model successfully identified the simulated QTL and the two cofactors.
This paragraph describes the result of 100 repeated simulations generated from the same set of parameters. This experiment allowed us to evaluate the power and false positive rate of QTL identification. The critical value for the LOD score was 2.99, which was generated  the replicated simulation experiments are given in Table  2. The average estimated effects (QTL and co-factors) are consistently smaller than the true values due to the shrinkage nature of the estimation. The biases, however, are not too strong to affect the powers because all effects have been detected with very high powers (ranging from 71% to 100%). For the entire 100 replications, we only detected five false positives (positive markers that are 15 cM away from a true effect). The overall false positive rate is 5/[100 ×(481-5 × 6) = 0.0001111, extremely low. The number 481 in the denominator is the total number of markers, the

Discussion and conclusions
Genome-wide segregation distortion is a common phenomenon in genetic mapping, but it is usually ignored. The main reason is the difficulty in joint estimation and tests of the segregation distortion loci. We formulated the problem as a typical quantitative genetics problem using a hypothetical liability to describe the fitness of each individual. Using a generalized linear mixed model, we were able to estimate and test genome-wide quantitative trait loci controlling the hidden liability. We used a mouse dataset to demonstrate the method and detected a major QTL for the liability that explains 93% of the liability variance. The simulated data experiment showed that the method can detect a QTL (e.g., the second QTL simulated) explaining 7.71% of the liability variation with 71% power. The method was implemented in a SAS/IML program. The code is posted on our website (http://www.statgen.ucr.edu) for general application. With this method and the program, genome-wide segregation distortion can be investigated routinely in future genetic data analysis.
As a Bayesian method, there are a rich array of prior distributions can be explored. In this study, we used the inverse Wishart as the prior distribution for the prior variance matrix of QTL effects. For the additive genetic model (one effect per locus), the inverse Wishart distribution becomes a scaled inverse Chi-square distribution. It is possible to use the exponential distribution (the Lasso prior) as an alternative prior [32]. Because the method uses multiple levels of prior choice, the model can also be called hierarchical generalized linear mixed model [24,33]. This study opens a new area in statistical genetics and further studies are expected to arise. For example, how to use the adaptive Lasso [34] to address this problem is entirely unknown and can be explored in the future.
A caveat of this method is the requirement of Mendelian segregation ratio (before the selection). For populations generated through line crossing experiments, Mendelian ratios are known. However, for uncontrolled populations, the theoretical Mendelian frequencies are not available. In this case, one needs to survey the unselected population to obtain the genotypic frequencies as the controlled "Mendelian segregation". If one can genotype both the selected and unselected individuals, one may simply use the case-control study and there is little reason to use this case-only study approach. In reality, genotyping individuals is much more costly than pooling the DNA of a sample of individuals. The cost effective approach is to genotype each individual in the surviving sample and genotype the pooled DNA sample for the unselected population because we only need the frequencies of genotypes (not the genotypes of individuals)  in the unselected population. For the co-factors, we also need the expected frequencies of the co-factors in the unselected population. We examined the sex effect (discrete co-factor) and a normally distributed co-factor. The expected 1:1 sex ratio was used as the expected frequency. For the normal co-factor, we used the mean and variance of the co-factor used in the simulation (the true values) to construct the expected distribution. In reality, one needs to survey the entire population to obtain the expected distribution. For continuous variables deviating from normality, one may discretize a variable to a few groups. For example, age is a quantitative variable but one can arbitrarily divide individuals into a few age groups. This discretization will eliminate the restriction of normal distribution. The method developed here can be applied to more broad situations beyond genetics without much modification. For example, if we know the joint distribution of k variables in a base (unselected) population and the joint distribution of the variables in a selected sample. We can simply test the difference between the two distributions to see which variables influence more on the selection. However, the pair-wise covariance may not allow us to make a precise decision on the importance of each variable. If two variables both influence the selection and they are highly correlated, the influence of one variable may be simply caused by the high correlation with the true causal variable. The proposed method here can help separate the true causality from the influence due to correlation.
QTL mapping is usually conducted in unselected populations. Individuals with undesired phenotypes must also be evaluated to obtain unbiased estimates of QTL effects. This is not a cost effective approach in breeding companies. Breeders wish to use only selected individuals to breed and keep no records for the unselected individuals. If we only evaluate the selected individuals, markers associated with the traits of interest will show distorted segregation. If the selection criterion is not well defined, for example, drought resistance, it is hard to map QTL. The segregation distortion loci are actually the QTL for drought resistance if one knows that there is no segregation distortion in the unselected population. The method developed here can be directly applied to mapping drought resistance QTL. Because we can perform QTL mapping using selected population, this approach may be called "mapping while selecting". For example, breeders may want to evaluate drought resistance of a family of recombinant inbred lines (RIL) by planting all seeds in a harsh drought environment. Eventually all plants die except the ones with strong resistance of drought. Breeders may have no records of the plants eliminated, but they can still perform QTL mapping for this trait (drought resistance) using all plants that have survived the selection. Other stress related traits can also be mapped using this approach, e. g., pest and salinity resistances.
In human genetics, case-control study is a common approach for mapping disease loci. In situations where there are no records for the control but the case, this case-only study may benefit from the new method. For example, one may easily get patient data from hospitals but hardly has individual records for the entire population. QTL mapping for the disease trait is still possible if we have the population records (frequencies) of genotypes in the entire population.
In summary, we developed a hierarchical generalized linear mixed model to map QTL for liability. This is a new approach to genetic mapping. It incorporates a seemingly different problem (segregation distortion) into the same QTL mapping framework for quantitative traits. Statistically, it shows that the generalized linear mixed model can be applied to situations where there are no phenotypic records; one only needs a likelihood function, a linear predictor and a prior distribution to infer the posterior mode estimation of the model effects.