A hierarchical statistical model for estimating population properties of quantitative genes
- Samuel S Wu^{1_36},
- Chang-Xing Ma^{1_36, 2_36},
- Rongling Wu^{1_36}Email author and
- George Casella^{1_36}
DOI: 10.1186/1471-2156-3-36
© Wu et al. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL. 2002
Received: 15 March 2002
Accepted: 12 June 2002
Published: 12 June 2002
Abstract
Background
Earlier methods for detecting major genes responsible for a quantitative trait rely critically upon a well-structured pedigree in which the segregation pattern of genes exactly follow Mendelian inheritance laws. However, for many outcrossing species, such pedigrees are not available and genes also display population properties.
Results
In this paper, a hierarchical statistical model is proposed to monitor the existence of a major gene based on its segregation and transmission across two successive generations. The model is implemented with an EM algorithm to provide maximum likelihood estimates for genetic parameters of the major locus. This new method is successfully applied to identify an additive gene having a large effect on stem height growth of aspen trees. The estimates of population genetic parameters for this major gene can be generalized to the original breeding population from which the parents were sampled. A simulation study is presented to evaluate finite sample properties of the model.
Conclusions
A hierarchical model was derived for detecting major genes affecting a quantitative trait based on progeny tests of outcrossing species. The new model takes into account the population genetic properties of genes and is expected to enhance the accuracy, precision and power of gene detection.
Background
The identification of individual genes governing phenotypic variation is crucial to understand the mechanistic basis of quantitative inheritance and ultimately provide information about the design of optimal breeding strategies in both plants and animals. Quantitative genetics based on new statistical and computational technologies has advanced to the point at which individual genes can be detected and mapped on chromosomes [1, 2]. The underpinning for the detection and mapping of genes is founded on their particular segregation pattern. For a pedigree derived from inbred lines, the segregation pattern of a gene can be exactly predicted. In this situation, quantitative genetic theory can lay a sufficient foundation for gene detection and, thus, the properties of a detected gene can be adequately described by its effect and genomic location. However, for outcrossing populations, such as forest trees, in which inbred lines are not available, genes also display population genetic properties [3]. Hence, the foundation for gene detection in these populations should be established on a combination of quantitative genetics and population genetics. The estimation of the population genetic properties of genes, such as allele frequencies and the degree of disequilibrium, can not only enhance the precision and power of gene detection, but also is fundamentally important to our understanding of population differentiation and evolution [4, 5].
We present a strategy for investigating the existence of genes in an arbitrarily complex progeny test based on phenotype data and estimating the effects of these genes on a quantitative trait and their population properties under the maximum likelihood framework. Only a limited number of statistical studies have been performed to detect genes based on the phenotypic data [6–13], and none of these studies have considered the population and evolutionary genetic properties of genes. Our study population can be derived from a natural population instead of inbred lines as used for general agronomic crops. A finite number of unrelated parents are randomly selected from a natural population and are mated to produce progeny tests under a mating design such as factorial or diallel. In a recent paper, we used such progeny tests to derive a model for detecting major genes affecting a quantitative trait [14]. This model has power to estimate the effect of drift errors on genetic variation during hybridization, thus providing information about dynamic changes of the genetic architecture of a population under artificial hybridization. However, the estimates of population genetic parameters from the progeny test using the above model cannot be generalized to the original population from which the mating design is derived. In this paper, we developed a hierarchical model for analyzing the pattern of gene segregation at both population and family levels, which thus can provide estimates of genetic parameters in two successive generations of the population. Estimates of population genetic parameters using this hierarchical model can reasonably infer the genetic structure of the original population. A forest tree example is used to demonstrate the application of the new statistical model for gene detection.
Model
Population genetic structure
Suppose there is a segregating major gene responsible for a quantitative trait in a natural or experimental diploid population. This gene has s different codominant alleles, A _{1},...,A _{ s }, whose frequencies in the population are denoted by p _{1},...,p _{ s } with p _{κ} = 1. These s alleles randomly unite to form s(s + 1)/2 distinguishable genotypes including s homozygotes and s(s - 1)/2 heterozygotes. The population frequencies of the genotypes are denoted by p _{κι} (κ, ι = 1,...,s). If the population is at Hardy-Weinberg disequilibrium (HWD), the genotypic frequency is the product of the two corresponding allelic frequencies, plus the coefficient of HWD (d _{κι}), i.e., p _{κι} = p _{κ} p _{ι} + d _{κι}, where d _{κι} has the restrictions, max{-p _{κ} (1 - p _{ι}), -p _{ι} (1 - p _{κ})} ≤ d _{κι} ≤ p _{κ} p _{ι} and ∑_{υ≠κ} d _{κυ} - ∑_{ v≠κ} d _{υυ} ≥ - [15].
Mating design
Factorial mating design using female and male parents sampled from a natural population.
Male | |||||
---|---|---|---|---|---|
A _{1} A _{1} | A _{1} A _{2} | — | A _{ s } A _{ s } | ||
(2p _{1} p _{2}) | — | ||||
A _{1} A _{1} | A _{1} A _{1} | A _{1} A _{1} : A _{1} A _{2} | — | A _{1} A _{ s } | |
(1) | (1/2:1/2) | — | (1) | ||
A _{1} A _{2} | A _{1} A _{1} : A _{1} A _{2} | A _{1} A _{1} : A _{1} A _{2} : A _{2} A _{2} | — | A _{1} A _{ s } : A _{2} A _{ s } | |
(2p _{1} p _{2}) | (1/2 : 1/2) | (1/4:1/2:1/4) | — | (1/2 : 1/2) | |
Female | | | | | | | | | | |
A _{κ} A _{ι} | A _{κ} A _{1} : A _{ι} A _{1} | A _{κ},A _{1} : A _{κ} A _{2} : A _{ι} A _{1} : A _{ι} A _{2} | — | A _{κ} A _{ s } : A _{ι} A _{ s } | |
(2p _{κ} p _{ι}) | (1/2 : 1/2) | (1/4 : 1/4 : 1/4 : 1/4) | — | (1/2 : 1/2) | |
| | | | | | | | | | |
A _{ s } A _{ s } | A _{1} A _{ s } | A _{1} A _{ s } : A _{2} A _{ s } | — | A _{ s } A _{ s } | |
(1) | (1/2:1/2) | — | (1) |
Estimation method
Let y _{ ijk } denote the phenotypic measurement for the kth offspring from ith female parent and jth male parent and g _{ ijk } denote its genotype at the major gene, i = 1,2, ..., I, j = 1, 2, ..., J, k = 1,2, ..., n _{ ij }. The statistical model describing the phenotype-genotype relationship can be expressed as
where μ_{κι} is the genotypic value of genotype A _{κ} A _{ι}; and e _{ ijk } is the error term with the normal distribution N(0,σ^{2}).
Let be the genotype for the ith female parent and for the jth male parent. We denote by y the collection of the observed data and by z = (g,g _{ f },g _{ m }) the collection of the major-locus genotypes for offspring, female parents and male parents. We assume the following hierarchical model:
where f denotes the normal distribution of the phenotype N(μ_{κι}, σ^{2}), h is distributed according to the Mendelian segregation pattern and ψ is multinomially distributed with probability vector p. Note that genotypes for siblings are dependent and, given genotypes, the traits are independent of each other. Under the above model (y, z) has a joint distribution as follows:
Statistically, equation (2) is a mixture model in which each component is represented by a genotype within a full-sib family [16–18]. The maximum likelihood estimates for θ = (μ, σ^{2}, p) can be obtained using the EM algorithm [19, 20]. At the t ^{ th } step of the EM sequence, the expected log-likelihood is proportional to
where Σ_{ G } sums over all possible genotypes, are the posterior distributions of offspring genotypes and iG, jG are the posterior distributions of parent genotypes. Conditional on , these posterior probabilities can be evaluated as follows:
The EM sequence is defined by
Suppose the major gene has two different codominant alleles, the posterior probablities in equation (4) can be exactly calculated for a usual mating design. However, for large designs with many parents or for the case where the major gene has many different alleles, we have to rely on a Monte Carlo version of the EM algorithm. For example, we can sample from the conditional distribution of z = (g,g _{ f }, g _{ m }) given using the following Gibbs sampler [20]:
Example
We use data from aspen trees to illustrate the appication of our statistical model for detecting a major gene. The example is derived from a 6 × 6 factorial mating design of Populus tremuloides and P. tremula. The parents for the crosses were randomly selected from a mixed breeding population of two different species established at the University of Minnesota. Originally, the P. tremuloides trees used as the female parents from the Upper Peninsula of Michigan and northern Wisconsin, and the P. tremula male parents came from northeastern Poland and Germany [21]. The seedlings from the progeny population were planted in two different locations, one on a cutover forest site near Grand Rapids, Minnesota, and the other on farmland near Rhinelander, Wisconsin. Both trials were laid out in a randomized complete block design with 10 replicates and six seedlings per family in each replicate at 2.5 × 2.5 spacing. At the end of the second year in the field, all trees were measured for stem height.
The effect and genotypic frequencies of the major gene are estimated using the hierarchical model assuming two alleles. Since no significant difference is detected between two locations, we only report the result based on pooled data. With probability of almost one, parents TA-1-91, TA-2-68, TA-4-91, XTA-3-6 have genotype A _{1} A _{1}, parents TA-1-68, TA-1-75, CLONE-5, TA-5-61 have genotype A _{1} A _{2} and the rest parents have genotype A _{2} A _{2}. The estimates of the genotypic values of A _{1} A _{1}, A _{1} A _{2} and A _{2} A _{2} are _{11} = 56.7 ± 0.80, _{12} = 40.6 ± 0.32 and _{22} = 30.0 ± 1.02, respectively. The estimate of residual variance is = 218.676 ± 6.09. Here, the estimates of standard errors for the parameter estimators are obtained based on the Fisher information matrix [22]. We also fitted the model under the null hypothesis that there is no major gene with a single normal distribution. The likelihood ratio test has (p < 0.0001), suggesting the existence of a major gene for height growth. The ratio of the dominant to additive effect for this major gene is less than 0.2, which implies that it affects the stem height growth in an additive manner.
The population genetic parameters of this major gene are estimated based on the genotypes of the parents detected. The genotype frequencies estimated are identical for the three genotypes, i.e. _{11} = _{12} = _{22} = 1/3. In consequence, the allele frequencies are also same ( _{1} = _{2} = 1/2). Thus, the mixed breeding population of P. tremulodes and P. tremula is not at Hardy-Weinberg equilibrium, because the genotype frequencies are not products of the corresponding allele frequencies. The coefficient of Hardy-Weinberg disequilibrium is estimated as . The estimate of allele frequencies and Hardy-Weinberg disequilibrium using our hierarchical model can be well generalized to the original breeding population from which the parents were sampled. However, one should be cautious about the accuracy of the estimator due to the small sample size.
Simulation
The behavior of the EM estimators is evaluated by two simulation experiments. In the first simulation, a 6 by 6 mating design same as our real example is assumed. The major gene is assumed to have two codominant alleles A _{1} and A _{2} whose frequencies in the population are p _{1} = 0.6 and p _{2} = 0.4. The genotypic values are set to be 65, 40 and 25 with standard deviation 15. Based on the hierarchical model defined in equation (1), we first simulated a set of the parental genotypes (A _{1} A _{1},A _{1} A _{1}, A _{1} A _{2}, A _{1} A _{1}, A _{1} A _{2}, A _{1} A _{1} for females and A _{1} A _{2},A _{2} A _{2}, A _{2} A _{2}, A _{1} A _{1}, A _{1} A _{2}, A _{1} A _{2} for males). Then we randomly generated 50 offspring genotypes and observations for each family.
Our EM estimating procedure converged to the correct parental genotypes after only a few iterations (normally 2 or 3). A loglikelihood test is highly significant with log-likelihood ratio equals 125 (df = 4). The estimates of genotypic values from a single simulation replicate are 64.1 ± 0.74 for A _{1} A _{1}, 39.5 ± 0.59 for A _{1} A _{2} and and 26.8 ± 1.64 for A _{2} A _{2}. The estimate of residual variance is 241.6 ± 10.30.
In our second simulation we studied a 5 by 5 mating design. We used 25 replications to investigate the empirical accuracy of the estimators. We assume the major gene has 2 alleles and randomly generated parental genotypes. The parental genotypes were randomly generated based on allele frequence p _{1} = 0.5. That resulted in female genotypes (A _{1} A _{1}, A _{1} A _{2}, A _{2} A _{2}, A _{1} A _{2}, A _{2} A _{2}) and male genotypes (A _{1} A _{2}, A _{1} A _{1}, A _{2} A _{2}, A _{1} A _{1}, A _{1} A _{2}). For each family we simulated the 50 genotypes for its offsprings and then simulated its trait measurement using a normal and log-normal distribution.
Discussion
We have derived a hierarchical model for detecting major genes affecting a quantitative trait based on progeny tests of outcrossing species. The new model takes into account the population genetic properties of genes and is expected to enhance the accuracy, precision and power of gene detection. The information about the population behavior of genes estimated from this model is fundamentally important to our understanding of the genetic architecture of a natural or experimental population and is also useful for the design of sound breeding strategies for agricultural crops and forest trees.
We used an example from interspecific aspen hybrids to illustrate the application of our gene-detecting model. The model has successfully identified a major gene with additive effects on second year stem growth in the hybrid aspens. The additive effect of genes on height growth in young poplars was also observed in a molecular marker experiment [23]. In this marker experiment, an additive quantitative trait locus was found to explain almost a quarter of the total phenotypic variation in second year height growth. The result from the aspen hybrids was consistent with that from a simulation experiment based on a mating scheme analogous to that used in our real example. The consistency of the results from the simulation study suggests that the model proposed here can be adequately used to detect genes of large effect on the phenotypes.
The model will have immediate applications for many species in which progeny tests have been established for a number of years [24]. Most of these tests are maintained in different locations and measured annually, thus offering desirable opportunities to address two important questions: (1) How does the genes interact with environment? (2) What is the developmental plasticity of gene expression? In our real example, no evidence has been obtained for the change of gene expression at the major locus detected over the two field trials, despite their dramatic differences in climate, soil properties and silvicultural measures [21]. If the stability of the expression of this gene over a range of environments is confirmed by more accurate genotype determination methods, e.g. based on genetic markers. It will have a tremendous application in breeding for stable cultivars of forest tree species.
Our analysis and simulation is based on diallelic inheritance. But multiallelic inheritance can be similarly considered, although more parameters should be introduced. Also, our model assumes the segregation of a single gene in progeny tests. The principle behind the model can be extended to consider two or more genes. Modelling multiple genes may be closer to biological reality, because the linkage, epistatic interaction and genetic association among genes are considered. However, the consideration of these relationships needs to estimate an increased number of unknown genetic parameters. Finally, our model is based on a balanced factorial design. For some populations, like mammals, mating designs are frequently unbalanced because of limited resources or reproductive difficulties. It is very important to extend our analysis and model to an unbalanced mating design of any complexity. For all of these extensions mentioned above, maximum likelihood approaches may not be sufficient owing to an increasing dimension of parameter space. A more powerful computational tool, e.g., Markov chain Monte Carol methods, may be needed [20] to make our model more tractable in real applications.
Declarations
Acknowledgements
We thank Dr. Jim Hobert, Dr. Dudley Huber, Dr. Bailian Li and Dr. Tim White for helpful discussions and reviews regarding this study. This manuscript was approved as Journal Series No. R-08032 by the Florida Agricultural Experiment Station.
Authors’ Affiliations
References
- Hoeschele I, Uimari P, Grignola FE, Zhang Q, Gage KM: Advances in statistical methods to map quantitative trait loci in outbred populations. Genetics 1997, 147:1445–1457.PubMedGoogle Scholar
- Balding JD, Bishop M, Cannings C: Handbook of statistical genetics. John Wiley & Sons, New York 2001.Google Scholar
- Wu RL: Mapping quantitative trait loci by genotyping haploid tissues. Genetics 1999, 152:1741–1752.PubMedGoogle Scholar
- Wright S: Evolution and the genetic of populations. II. The theory of gene frequencies. Univ. of Chicago Press, Chicago 1969.Google Scholar
- Elston RC: Segregation analysis. Adv Hum Genet 1981, 11:372–3.Google Scholar
- Hoeschele I: Genetic evaluation with data presenting evidence of mixed major gene and polygenic inheritance. Theoretical and Applied Genetics 1988, 76:81–92.View ArticleGoogle Scholar
- Hoeschele I: Statistical techniques for detection of major genes in animal breeding data. Theoretical and Applied Genetics 1988, 76:311–319.Google Scholar
- Le Roy P, Naveau J, Elsen JM, Sellier P: Evidence for a new major gene influencing meat quality in pigs. Genetical Research 1990, 55:33–40.View ArticlePubMedGoogle Scholar
- Knott SA, Haley CS, Thompson R: Methods of segregation analysis for animal breeding data: a comparison of power. Heredity 1992, 68:299–311.PubMedGoogle Scholar
- Le Roy P, Elsen JM: Simple test statistics for major gene detection: a numerical comparison. Theoretical and Applied Genetics 1992, 83:635–644.Google Scholar
- Loisel R, Goffinet B, Monod H, De Oca GM: Detecting a major gene in an F2 population. Biometrics 1994, 50:512–516.View ArticlePubMedGoogle Scholar
- Janss LL, Thompson R, Van Arendonk JAM: Application of Gibbs sampling for inference in a mixed major gene-polygenic inheritance model in animal population. Theoretical and Applied Genetics 1995, 91:1137–1147.View ArticleGoogle Scholar
- Janss LL, Van Arendonk JAM, Brascamp EW: Bayesian statistical analysis for presence of single genes affecting meat quality traits in a crossed pig population. Genetics 1997, 145:395–408.PubMedGoogle Scholar
- Wu RL, Li BL, Wu SS, Casella G: A maximum likelihood-based method for mining major genes affecting a quantitative character. Biometrics 2001, 57:764–768.View ArticlePubMedGoogle Scholar
- Weir BS: Genetic Data Analysis II. Sinauer Associates, Sunderland, MA 1996.Google Scholar
- Titterington DM, Smith AFM, Makov UE: Statistical Analysis of Finite Mixture Distributions. John Wiley & Sons, NY 1985.Google Scholar
- Feng ZD, McCulloch CE: On the likelihood ratio test statistic for the number of components in a normal mixture with unequal variances. Biometrics 1994, 50:1158–1162.View ArticlePubMedGoogle Scholar
- Fernando RL, Stricker C, Elston RC: The finite polygenic mixed model: an alternative formulation for the mixed model of inheritance. Theoretical and Applied Genetics 1994, 88:573–580.View ArticleGoogle Scholar
- Dempster AP, Laird NM, Rubin DB: Maximum likelihood from incomplete data via EM algorithm. Journal of Royal Statistical Society Serial B 1977, 39:1–38.Google Scholar
- Robert C, Casella G: Monte Carlo Statistical Methods. Springer, NY 1999.Google Scholar
- Li BL, Wu RL: Heterosis and genotype x environment interactions of juvenile aspens in two contrasting sites. Can. J. Forest Res. 1997, 27:1525–1537.View ArticleGoogle Scholar
- Oakes D: Direct calculation of the information matrix via the EM algorithm. J. R. Statist. Soc. B 1999, 61:479–482.View ArticleGoogle Scholar
- Bradshaw HD, Stettler RF: Molecular-genetics of growth and development in Populus. 4 Mapping QTLs with large effects on growth, form, and phenology traits in a forest tree. Genetics 1995, 139:963–973.PubMedGoogle Scholar
- McKeand SE, Bridgwater FE: A strategy for the third breeding cycle of loblolly pine in the southeastern US. Silvae Genetica 1998, 47:223–234.Google Scholar