Volume 6 Supplement 1
Genetic Analysis Workshop 14: Microsatellite and singlenucleotide polymorphism
Potts model for haplotype associations
 Elena V Moltchanova†^{1}Email author,
 Janne Pitkäniemi†^{2} and
 Laura Haapala^{3}
DOI: 10.1186/147121566S1S64
© Moltchanova et al; licensee BioMed Central Ltd 2005
Published: 30 December 2005
Abstract
Bayesian spatial modeling has become important in disease mapping and has also been suggested as a useful tool in genetic fine mapping. We have implemented the Potts model and applied it to the Genetic Analysis Workshop 14 (GAW14) simulated data. Because the "answers" were known we have analyzed latent phenotype P 1related observed phenotypes affection status (genetically determined) and i (random) in the Danacaa population replicate 2. Analysis of the microsatellite/singlenucleotide polymorphismbased haplotypes at chromosomes 1 and 3 failed to identify multiple clusters of haplotype effects. However, the analysis of separately simulated data with postulated differences in the effects of the two clusters has yielded clear estimated division into the two clusters, demonstrating the correctness of the algorithm. Although we could not clearly identify the diseaserelated and the nonassociated groups of haplotypes, results of both GAW14 and our own simulation encourage us to improve the efficiency and sensitivity of the estimation algorithm and to further compare the proposed method with more traditional methods.
Background
Bayesian smoothing methods, which during the recent decade have been widely used in the field of spatial epidemiology [1–3], have recently been proposed as a tool for haplotype effect estimation in fine mapping [4, 5]. Such spatial modeling of haplotype effects is based upon some measure of "similarity" between haplotypes and upon the belief that similar haplotypes would affect the phenotype in the same way. Cluster models allow us to go a step further and to group haplotypes according to the magnitude of such an effect. In this paper we report the results of implementing the Potts model [3] using the reversible jump Markov chain Monte Carlo (rjMCMC) technique. The model was applied to the Genetic Analysis Workshop 14 (GAW14) simulated microsatellite and singlenucleotide polymorphism (SNP) data on the Danacaa population replicate 2 on chromosomes 1 and 3 for the affection status (genetically determined) and phenotype i (randomly selected). The results of the Potts model are compared to the results of a more traditional conditional autoregressive (CAR) model [1].
Methods
Let Y_{ i }denote the observed dichotomous phenotypes of a sample of i = 1, ..., I subjects, where Y_{ i }= 1 if the subject i is a case and Y_{ i }= 0 otherwise. Suppose also that for every subject a haplotype H_{ i }= h_{1i}, h_{2i}is determined with certainty and that some further information on the subjects such as age and sex is available in the form of a covariate matrix X. Assuming a standard logistic regression penetrance model we have:
where δ_{ h }is the effect of the haplotype h on the probability of exhibiting the studied phenotype and γ is the vector of effects of the individual covariates.
has been considered (e.g., Thomas et al. [4, 5]) in the context of genetics. Here w_{ hk }are the elements of a symmetrical weight matrix with a null diagonal. In genetic modeling of haplotype effects weights can be defined as the number of alleles shared by a pair of haplotypes or some other more complex genetically based measure. This Bayesian spatial model (BYM), described in detail in Besag et al. [1] has been widely used, especially in epidemiology. More recently several clustering models have been proposed, among them the Potts model [3] of the form
where z_{ h }denotes the "cluster" to which the haplotype h belongs and is a relative risk parameter common to all the haplotypes assigned to a particular cluster z. In the absence of additional covariates X the likelihood may be written down as:
For the identification purposes we have set: δ_{1} <δ_{2} < ... <δ_{ k }, where k is the number of clusters. In this setup the number of clusters k as well as the allocation vector z also have to be estimated. In the Potts model formulation the elements of z are modelled jointly conditional on the number of clusters, k:
For large k the normalizing constant cannot be evaluated analytically and has to be precomputed. Because of this we need to take ψ to have a discrete distribution uniform on the values {0, 0.1,..., ψ_{ max }}. Also, here we assume the prior on the number of components to be uniform on the values {1, 2, ..., k_{ max }}, but a more informative prior such as Poisson may be employed instead (e.g., to indicate preference for the smaller number of clusters).
In order to set up a full Bayesian model, we also need to assign the prior to the parameters δ and τ. If we assign to each component of the vector δ a vague normal distribution with mean 0 and precision (i.e., inverse variance) 0.01^{2}, the joint prior for δ may be written as
Commonly MCMC methods are used in fitting Bayesian models. However because the number of clusters k is unknown here, a special dimensionswitching move is required along with the usual fixed dimension moves. Therefore a rjMCMC algorithm has been used here [6].
Results
Danacaa, D03S0126–D03S0127
Due to the computational complexity of the model we restricted our analysis to Danacaa population using replicate 2. Because the "answers" were known, we chose to use phenotypes affection status and i.
Additionally simulated data
Estimation results for the simulated data. The table presents the result of Bayesian estimation for the simulated data for the Potts and CAR models.
BYM  Potts  

Haplotype h  δ_{ h }mean  δ_{ h }95% CI  'true' z_{ h }  z_{ h }mean  δ_{ z }mean  δ_{ z }95% CI 
11  1.3835  (2.6663, 0.3718)  1.0000  1.0358  1.8666  (2.4328, 1.3697) 
12  1.1221  (1.9223, 0.4116)  1.0000  1.0341  1.8666  (2.4328, 1.3697) 
13  0.1517  (0.9910, 0.7257)  2.0000  1.9842  0.0313  (0.1198, 0.0554) 
14  0.0130  (0.5484, 0.5487)  2.0000  1.9962  0.0313  (0.1198, 0.0554) 
21  1.9963  (3.3693, 0.9992)  1.0000  1.0008  1.8666  (2.4328, 1.3697) 
22  1.4933  (2.1043, 0.9405)  1.0000  1.0068  1.8666  (2.4328, 1.3697) 
23  0.1148  (0.7801, 0.6017)  2.0000  1.9962  0.0313  (0.1198, 0.0554) 
24  0.2165  (0.5794, 0.1610)  2.0000  1.9945  0.0313  (0.1198, 0.0554) 
31  0.9589  (2.0073, 0.0414)  2.0000  1.3716  0.0313  (0.1198, 0.0554) 
32  0.1150  (0.3489, 0.6017)  2.0000  1.9968  0.0313  (0.1198, 0.0554) 
33  0.1863  (1.1841, 0.8477)  2.0000  1.9532  0.0313  (0.1198, 0.0554) 
34  0.4010  (0.7919, 0.0027)  2.0000  1.9929  0.0313  (0.1198, 0.0554) 
41  0.1790  (0.7660, 0.4302)  2.0000  1.9958  0.0313  (0.1198, 0.0554) 
42  0.0357  (0.3165, 0.4102)  2.0000  1.9973  0.0313  (0.1198, 0.0554) 
43  0.2117  (0.2653, 0.7057)  2.0000  1.9972  0.0313  (0.1198, 0.0554) 
44  0.0071  (0.2966, 0.3009)  2.0000  1.9969  0.0313  (0.1198, 0.0554) 
Discussion
Fine mapping is gaining importance as a tool in the search of the genetic basis of complex traits, while knowledge of the patterns of the human linkage disequilibrium is increasing. We have implemented the Bayesian spatial approach proposed in [5]. Our results for the GAW14 Danacaa population replicate 2 with microsatellite marker haplotypes in the neighborhood of disease locus D1 failed to identify any haplotype grouping. However, when applied to the data simulated for the same haplotypes with effects set up into two groups with the effects δ = (2, 0) the model has correctly sorted haplotypes into the two clusters and estimated the effects. It can be concluded therefore that in the case of Danacaa population the model has not proved sensitive enough to detect the effect on the provided sample size. The BYM model [1], which has been widely used for example in the field of spatial epidemiology for over a decade, has been used for comparison. It has also failed to produce evidence of clustering, since the resulting 95% confidence intervals for all the haplotypes are overlapping.
There are certain technical difficulties in estimating the Potts model, one of which is the evaluation of the normalizing constant. We have used the thermodynamic integration approach as proposed by Green and Richardson [3] in conjunction with the Simpson's Rule. Other difficulties lie in constructing the efficient sampling algorithm. The Poisson model used by Green and Richardson [3] in conjunction with gamma priors on the effects leads to certain 'nice' results, but the high incidence of some phenotypes (e.g., e and f) does not allow the natural binomial distribution to be approximated by Poisson. Therefore, we had to deal with a rather unwieldy logit transformation. We plan to improve the algorithm further by searching for a better sampling distribution so as to provide better mixing and faster convergence. We found both the sampling schema and the complexity of the phenotype very challenging and because of the complex model used we have ignored ascertainment. Therefore the estimated haplotype effects reflect only the sample at hand and not the prevalence in the base population.
The similarity matrix of the haplotypes in this study is based on the number of alleles identical by state, but from the genetics point of view it would be more informative to use identity by descent information that can be obtained from other genetic computer software programs such as PEDPHASE [7]. In the future we plan to use more simulations in order to gain better understanding of the statistical properties of the Potts model in its applications to genetic fine mapping of complex diseases. Some comparison between SNPs and microsatellite markers will also be considered, provided the time required to estimate model parameters can be reduced.
Conclusion
The aim of this article was to test the usefulness of the Potts approach in the genetic analysis. Unfortunately, the results of were not encouraging because neither the Potts nor the comparable BYM model found any haplotype grouping. However, as noted in the discussion, we believe that the approach may work in certain situations. More investigation is needed to determine the conditions under which the proposed approach may prove useful.
Notes
Abbreviations
 BYM:

Bayesian spatial model
 CAR:

Conditional autoregressive
 GAW14:

Genetic Analysis Workshop 14
 rjMCMC:

Reversible jump Markov chain Monte Carlo
 SNP:

Singlenucleotide polymorphism
Declarations
Acknowledgements
This work is partly supported by the postgraduate school of COMAS (University of Jyväskylä, EVM), the Juvenile Diabetes Research Foundation International, the Academy of Finland (51224, 51225, 46558), and the Sigrid Juselius Foundation (LH and JP).
Authors’ Affiliations
References
 Besag J, York J, Mollie A: Bayesian image restoration with two applications in spatial statistics. Ann Inst Stat Math. 1991, 43: 159. 10.1007/BF00116466.View ArticleGoogle Scholar
 Elliot P, Wakefield JC, Best NG, Briggs DJ: Spatial Epidemiology: Methods and Applications. 2001, United Kingdom: Oxford University PressView ArticleGoogle Scholar
 Green PJ, Richardson S: Hidden Markov models and disease mapping. J Am Stat Assoc. 2002, 97: 10551070. 10.1198/016214502388618870.View ArticleGoogle Scholar
 Thomas DC, Morrison J, Clayton DG: Bayes estimates of haplotype effects. Genet Epidemiol. 21 (Suppl 1): S712S717.
 Thomas DC, Stram DO, Conti D, Molitor J, Marjoram P: Bayesian spatial modeling of haplotype associations. Hum Hered. 2003, 56: 3240. 10.1159/000073730.View ArticlePubMedGoogle Scholar
 Green PJ: Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika. 1995, 82: 711732. 10.2307/2337340.View ArticleGoogle Scholar
 Li J, Jiang T: Efficient inference of haplotypes from genotype on pedigree. J Bioinform Comput Biol. 2003, 1: 4169. 10.1142/S0219720003000204.View ArticlePubMedGoogle Scholar
 The R project for Statistical Computing. [http://www.rproject.org]
 The BUGS project – WinBUGS. [http://www.mrcbsu.cam.ac.uk/bugs/winbugs/contents.shtml]
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.