- Proceedings
- Open access
- Published:
Potts model for haplotype associations
BMC Genetics volume 6, Article number: S64 (2005)
Abstract
Bayesian spatial modeling has become important in disease mapping and has also been suggested as a useful tool in genetic fine mapping. We have implemented the Potts model and applied it to the Genetic Analysis Workshop 14 (GAW14) simulated data. Because the "answers" were known we have analyzed latent phenotype P 1-related observed phenotypes affection status (genetically determined) and i (random) in the Danacaa population replicate 2. Analysis of the microsatellite/single-nucleotide polymorphism-based haplotypes at chromosomes 1 and 3 failed to identify multiple clusters of haplotype effects. However, the analysis of separately simulated data with postulated differences in the effects of the two clusters has yielded clear estimated division into the two clusters, demonstrating the correctness of the algorithm. Although we could not clearly identify the disease-related and the non-associated groups of haplotypes, results of both GAW14 and our own simulation encourage us to improve the efficiency and sensitivity of the estimation algorithm and to further compare the proposed method with more traditional methods.
Background
Bayesian smoothing methods, which during the recent decade have been widely used in the field of spatial epidemiology [1–3], have recently been proposed as a tool for haplotype effect estimation in fine mapping [4, 5]. Such spatial modeling of haplotype effects is based upon some measure of "similarity" between haplotypes and upon the belief that similar haplotypes would affect the phenotype in the same way. Cluster models allow us to go a step further and to group haplotypes according to the magnitude of such an effect. In this paper we report the results of implementing the Potts model [3] using the reversible jump Markov chain Monte Carlo (rjMCMC) technique. The model was applied to the Genetic Analysis Workshop 14 (GAW14) simulated microsatellite and single-nucleotide polymorphism (SNP) data on the Danacaa population replicate 2 on chromosomes 1 and 3 for the affection status (genetically determined) and phenotype i (randomly selected). The results of the Potts model are compared to the results of a more traditional conditional auto-regressive (CAR) model [1].
Methods
Let Y i denote the observed dichotomous phenotypes of a sample of i = 1, ..., I subjects, where Y i = 1 if the subject i is a case and Y i = 0 otherwise. Suppose also that for every subject a haplotype H i = h1i, h2iis determined with certainty and that some further information on the subjects such as age and sex is available in the form of a covariate matrix X. Assuming a standard logistic regression penetrance model we have:
where δ h is the effect of the haplotype h on the probability of exhibiting the studied phenotype and γ is the vector of effects of the individual covariates.
Here
and the corresponding inverse function . The CAR model of the form
has been considered (e.g., Thomas et al. [4, 5]) in the context of genetics. Here w hk are the elements of a symmetrical weight matrix with a null diagonal. In genetic modeling of haplotype effects weights can be defined as the number of alleles shared by a pair of haplotypes or some other more complex genetically based measure. This Bayesian spatial model (BYM), described in detail in Besag et al. [1] has been widely used, especially in epidemiology. More recently several clustering models have been proposed, among them the Potts model [3] of the form
where z h denotes the "cluster" to which the haplotype h belongs and is a relative risk parameter common to all the haplotypes assigned to a particular cluster z. In the absence of additional covariates X the likelihood may be written down as:
For the identification purposes we have set: δ1 <δ2 < ... <δ k , where k is the number of clusters. In this setup the number of clusters k as well as the allocation vector z also have to be estimated. In the Potts model formulation the elements of z are modelled jointly conditional on the number of clusters, k:
where
and are the number of like-labeled neighbor pairs in the configuration z and an additive normalizing constant, respectively.
For large k the normalizing constant cannot be evaluated analytically and has to be precomputed. Because of this we need to take ψ to have a discrete distribution uniform on the values {0, 0.1,..., ψ max }. Also, here we assume the prior on the number of components to be uniform on the values {1, 2, ..., k max }, but a more informative prior such as Poisson may be employed instead (e.g., to indicate preference for the smaller number of clusters).
In order to set up a full Bayesian model, we also need to assign the prior to the parameters δ and τ. If we assign to each component of the vector δ a vague normal distribution with mean 0 and precision (i.e., inverse variance) 0.012, the joint prior for δ may be written as
Commonly MCMC methods are used in fitting Bayesian models. However because the number of clusters k is unknown here, a special dimension-switching move is required along with the usual fixed dimension moves. Therefore a rjMCMC algorithm has been used here [6].
Results
Danacaa, D03S0126–D03S0127
Due to the computational complexity of the model we restricted our analysis to Danacaa population using replicate 2. Because the "answers" were known, we chose to use phenotypes affection status and i.
Affection status is determined through disease loci D1 and D2 in complex manner. However, D1 determines phenotypes b and a and D2 e, f, g. Because D1 is located at chromosome 1 and D2 at chromosome 3. For the analysis of D1-associated haplotypes we chose microsatellite markers D01S023–D01S024 and the corresponding SNP was B01TT0561. Correspondingly, for the D2 we chose D03S126 and D03S127 and SNP B03T3067. Haplotypes were constructed using neighboring markers for both microsatellite and SNP data using PEDPHASE program [7]. As a comparison we analyzed trait i, which has no genetic determinants involved using the same haplotypes. The rjMCMC algorithm has been implemented in R [8], the CAR model used for comparison was run on BUGS [9]. In each case 100,000 iterations were run, of which 50,000 were discarded as a burn-in stage. The convergence was assessed visually by graphical examination of the envelopes and traces of the chains. The haplotypes having a common allele at either of the markers were regarded as neighbors. As an example we present the results of the analysis of the microsatellite markers of the Danacaa population replicate 2, area D03S0126–D03S0127. There were a total of 7 * 8 = 56 different possible haplotypes present in the sample. The average prevalence of the affected-trait (Kofendrerd Personality Disorder) in the Danacaa population was 0. The results of both the CAR model and the Potts model, which had suggested k = 1 as the most likely number of clusters (p(k = 1) = 0.9999), are shown in Figure 1.
Additionally simulated data
In order to ensure the proper functioning of the algorithm, we have also used additionally simulated data. It was modeled on the haplotypes observed in the Danacaa population, replicate 2 D01S023–D01S024. We took the simulated haplotypes of the Danacaa population and have simulated phenotypes for the subjects based on the haplotypes they possessed. This was done by dividing haplotypes into two clusters corresponding to the pre-defined phenotype effects (δ = (-2, 0)). The haplotypes 11, 12, 21, and 22 were set to belong to the cluster with effect δ1 = -2 and the rest to the cluster with δ2 = 0. The results of the estimation using rjMCMC and the comparison with CAR model has proved satisfactory and are illustrated in Figure 2 and in Table 1. The posterior probability was estimated p(k = 2) = 0.98 and the grouping of haplotypes into the two clusters was identified correctly.
Discussion
Fine mapping is gaining importance as a tool in the search of the genetic basis of complex traits, while knowledge of the patterns of the human linkage disequilibrium is increasing. We have implemented the Bayesian spatial approach proposed in [5]. Our results for the GAW14 Danacaa population replicate 2 with microsatellite marker haplotypes in the neighborhood of disease locus D1 failed to identify any haplotype grouping. However, when applied to the data simulated for the same haplotypes with effects set up into two groups with the effects δ = (-2, 0) the model has correctly sorted haplotypes into the two clusters and estimated the effects. It can be concluded therefore that in the case of Danacaa population the model has not proved sensitive enough to detect the effect on the provided sample size. The BYM model [1], which has been widely used for example in the field of spatial epidemiology for over a decade, has been used for comparison. It has also failed to produce evidence of clustering, since the resulting 95% confidence intervals for all the haplotypes are overlapping.
There are certain technical difficulties in estimating the Potts model, one of which is the evaluation of the normalizing constant. We have used the thermodynamic integration approach as proposed by Green and Richardson [3] in conjunction with the Simpson's Rule. Other difficulties lie in constructing the efficient sampling algorithm. The Poisson model used by Green and Richardson [3] in conjunction with gamma priors on the effects leads to certain 'nice' results, but the high incidence of some phenotypes (e.g., e and f) does not allow the natural binomial distribution to be approximated by Poisson. Therefore, we had to deal with a rather unwieldy logit transformation. We plan to improve the algorithm further by searching for a better sampling distribution so as to provide better mixing and faster convergence. We found both the sampling schema and the complexity of the phenotype very challenging and because of the complex model used we have ignored ascertainment. Therefore the estimated haplotype effects reflect only the sample at hand and not the prevalence in the base population.
The similarity matrix of the haplotypes in this study is based on the number of alleles identical by state, but from the genetics point of view it would be more informative to use identity by descent information that can be obtained from other genetic computer software programs such as PEDPHASE [7]. In the future we plan to use more simulations in order to gain better understanding of the statistical properties of the Potts model in its applications to genetic fine mapping of complex diseases. Some comparison between SNPs and microsatellite markers will also be considered, provided the time required to estimate model parameters can be reduced.
Conclusion
The aim of this article was to test the usefulness of the Potts approach in the genetic analysis. Unfortunately, the results of were not encouraging because neither the Potts nor the comparable BYM model found any haplotype grouping. However, as noted in the discussion, we believe that the approach may work in certain situations. More investigation is needed to determine the conditions under which the proposed approach may prove useful.
Abbreviations
- BYM:
-
Bayesian spatial model
- CAR:
-
Conditional auto-regressive
- GAW14:
-
Genetic Analysis Workshop 14
- rjMCMC:
-
Reversible jump Markov chain Monte Carlo
- SNP:
-
Single-nucleotide polymorphism
References
Besag J, York J, Mollie A: Bayesian image restoration with two applications in spatial statistics. Ann Inst Stat Math. 1991, 43: 1-59. 10.1007/BF00116466.
Elliot P, Wakefield JC, Best NG, Briggs DJ: Spatial Epidemiology: Methods and Applications. 2001, United Kingdom: Oxford University Press
Green PJ, Richardson S: Hidden Markov models and disease mapping. J Am Stat Assoc. 2002, 97: 1055-1070. 10.1198/016214502388618870.
Thomas DC, Morrison J, Clayton DG: Bayes estimates of haplotype effects. Genet Epidemiol. 21 (Suppl 1): S712-S717.
Thomas DC, Stram DO, Conti D, Molitor J, Marjoram P: Bayesian spatial modeling of haplotype associations. Hum Hered. 2003, 56: 32-40. 10.1159/000073730.
Green PJ: Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika. 1995, 82: 711-732. 10.2307/2337340.
Li J, Jiang T: Efficient inference of haplotypes from genotype on pedigree. J Bioinform Comput Biol. 2003, 1: 41-69. 10.1142/S0219720003000204.
The R project for Statistical Computing. [http://www.r-project.org]
The BUGS project – WinBUGS. [http://www.mrc-bsu.cam.ac.uk/bugs/winbugs/contents.shtml]
Acknowledgements
This work is partly supported by the postgraduate school of COMAS (University of Jyväskylä, EVM), the Juvenile Diabetes Research Foundation International, the Academy of Finland (51224, 51225, 46558), and the Sigrid Juselius Foundation (LH and JP).
Author information
Authors and Affiliations
Corresponding author
Additional information
Elena V Moltchanova, Janne Pitkäniemi contributed equally to this work.
Rights and permissions
Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Moltchanova, E.V., Pitkäniemi, J. & Haapala, L. Potts model for haplotype associations. BMC Genet 6 (Suppl 1), S64 (2005). https://doi.org/10.1186/1471-2156-6-S1-S64
Published:
DOI: https://doi.org/10.1186/1471-2156-6-S1-S64