Volume 6 Supplement 1
Genetic Analysis Workshop 14: Microsatellite and singlenucleotide polymorphism
Finescale mapping in casecontrol samples using locus scoring and haplotypesharing methods
 Keith Humphreys^{1}Email author and
 Mark M Iles^{1}
DOI: 10.1186/147121566S1S74
© Humphreys and Iles; licensee BioMed Central Ltd 2005
Published: 30 December 2005
Abstract
Both haplotypebased and locusbased methods have been proposed as the most powerful methods to employ when fine mapping by association. Although haplotypebased methods utilize more information, they may lose power as a result of overparameterization, given the large number of haplotypes possible over even a few loci. Recently methods have been developed that cluster haplotypes with similar structure in the hope that this reflects shared genealogical ancestry. The aim is to reduce the number of parameters while retaining the genotype information relating to disease susceptibility. We have compared several haplotypebased methods with locusbased methods. We utilized 2 regions (D2 and D4) simulated to be in linkage disequilibrium and to be associated with disease susceptibility, combining 5 replicates at a time to produce 4 datasets that were analyzed. We found little difference in the performance of the haplotypebased methods and the locusbased methods in this dataset.
Background
It is widely accepted that for the finescale mapping of disease susceptibility loci, associationbased approaches are more appropriate than linkage methods. Although genomewide association studies are often forecast, association studies currently focus predominantly on relatively small candidate regions. Such regions are suggested either by strong evidence from linkage studies or from functional arguments and are typically densely genotyped. More information about a candidate region is retained by incorporating phase in the analysis. However even in the presence of substantial linkage disequilibrium (LD) many haplotypes may exist and it has been suggested that for this reason haplotypebased studies lack power compared to 'locusscoring' approaches [1]. Recent approaches [2, 3] have sought to circumvent this problem by grouping similar haplotypes in the hope that such similarity will reflect a shared ancestry. Thus the parameter space can be reduced while, it is hoped, retaining that phase information relevant to disease susceptibility. Locusbased methods have been similarly refined to incorporate information on the estimated genealogical structure prior to formal testing [4]. The power of such approaches has been examined by Clayton et al. [5].
Using the Genetic Analysis Workshop 14 (GAW14) data we have compared the performances of several 'locusscoring' and 'haplotypesharing' approaches for detecting and localizing genetrait association using casecontrol studies. Such methods should be extendable to nuclear family data but as yet appropriate methodology has not been developed.
Methods
The first 3 methods we investigated are what we term 'locusscoring' methods and do not utilize haplotype information. The last three are 'haplotypesharing' methods and analyze the data by clustering similar haplotypes to reduce the dimensionality of the data. The 6 methods which we have implemented are denoted (i)–(vi) and are described below.
(i), (ii), (iii) Logistic regression is widely employed for the modelling of association between genes and binary traits [1, 6]. Here the continuous trait Y is regressed on the genotypes at the diallelic loci M under consideration, some of which may be etiological. The m^{th} genotype is coded X_{ m }according to the number of rare alleles at the m^{th} locus. The model is
We implicitly assume multiplicative penetrance. We consider models with M = 1 (i), M = 6 (ii) and M = 6 with 5 pairwise (adjacent loci) interactions (iii).
(iv) For sliding windows of M singlenucleotide polymorphisms (SNPs) Durrant et al. [3] suggest grouping similar haplotypes using hierarchical clustering. This requires calculation of a distance measure. For the case of no missing data and denoting alleles as 0 or 1, the distance between haplotypes i and j is measured as
where p_{ m }is the (observed) frequency of allele 1 at locus m and , where I(.) represents an indicator variable and denotes the allele at locus m of the haplotype. Durrant et al. [3] recommend performing the hierarchical clustering then fitting logistic regression models using haplotype cluster membership as covariates. They search for the optimal association across different numbers of clusters and SNP window sizes and apply a Bonferroni correction.
(v), (vi) We have modified approach (iv) by considering 2 alternative measures of similarity. The similarity between a pair of haplotypes is now measured without restriction to a window of markers. The distance measure used in (v) is based upon the length of the segment shared identically by state (IBS) around a putative locus in the studied region. Distance is measured simply as 1L_{1}/L_{2}, where L_{1} is the number of consecutive alleles shared either side of the putative locus, and L_{2} is the total number of markers in the region being studied. The putative locus is assumed to be located between a pair of adjacent markers. Each marker interval in the region is tested in turn as the putative locus. This is approach (v). Method (vi) modifies the distance measure used in (v) by incorporating allele frequency weights in a similar manner to (2). If k markers a, .., a+k1 are shared IBS then the distance is measured as
The clustering proceeds as in method (iv). Note that we do not incorporate physical distances between markers into our measures of haplotype distances. One approach for measuring haplotype distances, incorporating marker distance information is described by Molitor et al. [7].
Other, more flexible, approaches to haplotype clustering have been proposed but are computationally demanding and have not been included here for that reason. Thomas et al. [2], for example, propose assigning haplotypes to clusters probabilistically, using the Potts model and using reversible jump Markov chain Monte Carlo (MCMC) methods to update the number of clusters and the location of the variant. This is more flexible because it allows partitions other than those formed by cutting at various points on the dendogram/genealogical tree; it instead attaches higher prior weight to more likely partitions of haplotypes.
Results
We focused on susceptibility regions D2 and D4, simulated to be in LD. To ensure sufficient power we created four datasets of 500 cases (one affected individual from each family) and 250 controls by combining 5 replicates at a time from the Danacaa population. These datasets are referred to as Study 1 to Study 4. We used 3 locusscoring methods. Firstly, we tested a single locus at a time, referred to as method (i). (For region D2 SNPs 1–27 refer to B03T3041 to C04R0282; and for D4 SNPs 1–38 refer to B09T8321 to B09T8360). Then, using a window covering 6 markers, we applied two variantsthe first, (ii), incorporated only singlelocus main effects from each of the 6 markers, and the second, (iii), additionally included pairwise interactions between adjacent loci (11 parameters).
Haplotypes were reconstructed haplotypes from the available genotype data using maximum likelihood methods. Haplotypes were then grouped by hierarchical clustering, defining similarity either by the number of loci shared in common within the window (iv), or the maximum continuous length shared in common without weighting for allele frequency (v), and (inversely) weighting for allele frequency (vi). We tested different numbers of partitions and chose the optimal partition.
In order to better understand our results and, ultimately, the factors that determine the relative performances of the different methods, we investigated the structure of the LD in the studied regions. It is not clear how LD across the two studied regions can be formally compared. The extent of haplotype diversity can be informally judged by the proportion of theoretically possible haplotypes that are actually observed. We noticed that this proportion was markedly higher in the D4 region than it was in the D2 region. If we infer from this that the LD structure is more complex in D4 than D2, we might expect the haplotype methods to perform slightly better in D4 than in D2. We saw some evidence of this, but we recognize that the LD is confounded with the disease model, which differs between the two regions. Below we discuss possible approaches for more formally assessing the performance of LD mapping methods in relation to the LD structure of regions being studied.
Conclusion
Although our results are not encouraging for researchers developing haplotypesharing methods, it is important to be aware that such methods are dependent on haplotypes with similar risks having a shared ancestry, and the data considered here were not generated in such a way. Even in the D2 region, where the disease susceptibility haplotypes were chosen to be similar, a shared population history was not explicitly modelled and so it is difficult to know how well our results would generalize to realworld problems.
More thorough comparisons of the different strategies for fine mapping are needed to understand which tests are most appropriate and powerful in which situations. The types of comparisons that we have in mind are not possible using only the GAW14 simulated datasets. One approach would be to compare the performances of different methods for LD mapping when phenotypic data is simulated on the basis of different LD structures, in terms of conditional independence structures of markers in genomic regions. We are currently considering the use of loglinear models [10] for the simulation of LD structure. Loglinear models express logarithms of expected cell counts (in this case, cell counts are haplotype counts) in terms of a linear predictor including main effects and interaction terms (up to an order equal to the number of loci). We vary the highest order of interaction included in the model, as well as the strength of the interaction terms, and study the performance of the various LD mapping methods under different scenarios. The idea is not entirely new. Clayton et al. [5] have recently examined the extent to which phase is relevant to association, comparing haplotypebased and locusbased tests more formally, with the use of (linear) graphical models. They have considered some simple scenarios, such as under complete LD (where the value of Lewontin's D', but not R^{2}, is equal to 1 between every pair of markers). The use of graphs in connection with loglinear models is described by Edwards [10]. The choice of test may ultimately be best guided by LD structure within a region, and it is hoped that the types of studies which we have described can shed some light on how to do this in practice.
Abbreviations
 GAW14:

Genetic Analysis Workshop 14
 IBS:

Identical by state
 LD:

Linkage disequilibrium
 MCMC:

Markov chain Monte Carlo
 SNP:

Singlenucleotide polymorphism
Authors’ Affiliations
References
 Chapman JM, Cooper JD, Todd JA, Clayton DG: Detecting disease associations due to linkage disequilibrium using haplotype tags: a class of tests and the determinants of statistical power. Hum Hered. 2003, 56: 1831. 10.1159/000073729.View ArticlePubMed
 Thomas DC, Stram DO, Conti D, Molitor J, Marjoram P: Bayesian spatial modeling of haplotype associations. Hum Hered. 2003, 56: 3240. 10.1159/000073730.View ArticlePubMed
 Durrant C, Zondervan KT, Cardon LR, Hunt S, Deloukas P, Morris AP: Linkage disequilibrium mapping via cladistic analysis of singlenucleotide polymorphism haplotypes. Am J Hum Genet. 2004, 75: 3543. 10.1086/422174.PubMed CentralView ArticlePubMed
 Seltman H, Roeder K, Devlin B: Transmission/disequilibrium test meets measured haplotype analysis: familybased association analysis guided by evolution of haplotypes. Am J Hum Genet. 2001, 68: 12501263. 10.1086/320110.PubMed CentralView ArticlePubMed
 Clayton D, Chapman J, Cooper J: The use of unphased multilocus genotype data in indirect association studies. Genet Epidemiol. 2004, 27: 415428. 10.1002/gepi.20032.View ArticlePubMed
 Cordell HJ, Clayton DG: A unified stepwise regression procedure for evaluation the relative effects of polymorphisms within a gene using case/control or family data: application to HLA in type 1 diabetes. Am J Hum Genet. 2002, 70: 124141. 10.1086/338007.PubMed CentralView ArticlePubMed
 Molitor J, Marjoram P, Thomas D: Finescale mapping of disease genes with multiple mutations via spatial clustering techniques. Am J Hum Genet. 2003, 73: 13681384. 10.1086/380415.PubMed CentralView ArticlePubMed
 Nyholt DR: A simple correction for multiple testing for singlenucleotide polymorphisms in linkage disequilibrium with each other. Am J Hum Genet. 2004, 74: 765769. 10.1086/383251.PubMed CentralView ArticlePubMed
 Westfall P, Young S: ResamplingBased Multiple Testing: Examples and Methods for Pvalue Adjustment. 1993, New York: John Wiley & Sons
 Edwards D: Introduction to Graphical Modelling. 1995, New York: SpringerVerlagView Article
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.