Fine-scale mapping in case-control samples using locus scoring and haplotype-sharing methods
© Humphreys and Iles 2005
Published: 30 December 2005
Skip to main content
© Humphreys and Iles 2005
Published: 30 December 2005
Both haplotype-based and locus-based methods have been proposed as the most powerful methods to employ when fine mapping by association. Although haplotype-based methods utilize more information, they may lose power as a result of overparameterization, given the large number of haplotypes possible over even a few loci. Recently methods have been developed that cluster haplotypes with similar structure in the hope that this reflects shared genealogical ancestry. The aim is to reduce the number of parameters while retaining the genotype information relating to disease susceptibility. We have compared several haplotype-based methods with locus-based methods. We utilized 2 regions (D2 and D4) simulated to be in linkage disequilibrium and to be associated with disease susceptibility, combining 5 replicates at a time to produce 4 datasets that were analyzed. We found little difference in the performance of the haplotype-based methods and the locus-based methods in this dataset.
It is widely accepted that for the fine-scale mapping of disease susceptibility loci, association-based approaches are more appropriate than linkage methods. Although genome-wide association studies are often forecast, association studies currently focus predominantly on relatively small candidate regions. Such regions are suggested either by strong evidence from linkage studies or from functional arguments and are typically densely genotyped. More information about a candidate region is retained by incorporating phase in the analysis. However even in the presence of substantial linkage disequilibrium (LD) many haplotypes may exist and it has been suggested that for this reason haplotype-based studies lack power compared to 'locus-scoring' approaches . Recent approaches [2, 3] have sought to circumvent this problem by grouping similar haplotypes in the hope that such similarity will reflect a shared ancestry. Thus the parameter space can be reduced while, it is hoped, retaining that phase information relevant to disease susceptibility. Locus-based methods have been similarly refined to incorporate information on the estimated genealogical structure prior to formal testing . The power of such approaches has been examined by Clayton et al. .
Using the Genetic Analysis Workshop 14 (GAW14) data we have compared the performances of several 'locus-scoring' and 'haplotype-sharing' approaches for detecting and localizing gene-trait association using case-control studies. Such methods should be extendable to nuclear family data but as yet appropriate methodology has not been developed.
The first 3 methods we investigated are what we term 'locus-scoring' methods and do not utilize haplotype information. The last three are 'haplotype-sharing' methods and analyze the data by clustering similar haplotypes to reduce the dimensionality of the data. The 6 methods which we have implemented are denoted (i)–(vi) and are described below.
(i), (ii), (iii) Logistic regression is widely employed for the modelling of association between genes and binary traits [1, 6]. Here the continuous trait Y is regressed on the genotypes at the diallelic loci M under consideration, some of which may be etiological. The m th genotype is coded X m according to the number of rare alleles at the m th locus. The model is
We implicitly assume multiplicative penetrance. We consider models with M = 1 (i), M = 6 (ii) and M = 6 with 5 pair-wise (adjacent loci) interactions (iii).
(iv) For sliding windows of M single-nucleotide polymorphisms (SNPs) Durrant et al.  suggest grouping similar haplotypes using hierarchical clustering. This requires calculation of a distance measure. For the case of no missing data and denoting alleles as 0 or 1, the distance between haplotypes i and j is measured as
where p m is the (observed) frequency of allele 1 at locus m and , where I(.) represents an indicator variable and denotes the allele at locus m of the haplotype. Durrant et al.  recommend performing the hierarchical clustering then fitting logistic regression models using haplotype cluster membership as covariates. They search for the optimal association across different numbers of clusters and SNP window sizes and apply a Bonferroni correction.
(v), (vi) We have modified approach (iv) by considering 2 alternative measures of similarity. The similarity between a pair of haplotypes is now measured without restriction to a window of markers. The distance measure used in (v) is based upon the length of the segment shared identically by state (IBS) around a putative locus in the studied region. Distance is measured simply as 1-L1/L2, where L1 is the number of consecutive alleles shared either side of the putative locus, and L2 is the total number of markers in the region being studied. The putative locus is assumed to be located between a pair of adjacent markers. Each marker interval in the region is tested in turn as the putative locus. This is approach (v). Method (vi) modifies the distance measure used in (v) by incorporating allele frequency weights in a similar manner to (2). If k markers a, .., a+k-1 are shared IBS then the distance is measured as
The clustering proceeds as in method (iv). Note that we do not incorporate physical distances between markers into our measures of haplotype distances. One approach for measuring haplotype distances, incorporating marker distance information is described by Molitor et al. .
Other, more flexible, approaches to haplotype clustering have been proposed but are computationally demanding and have not been included here for that reason. Thomas et al. , for example, propose assigning haplotypes to clusters probabilistically, using the Potts model and using reversible jump Markov chain Monte Carlo (MCMC) methods to update the number of clusters and the location of the variant. This is more flexible because it allows partitions other than those formed by cutting at various points on the dendogram/genealogical tree; it instead attaches higher prior weight to more likely partitions of haplotypes.
We focused on susceptibility regions D2 and D4, simulated to be in LD. To ensure sufficient power we created four datasets of 500 cases (one affected individual from each family) and 250 controls by combining 5 replicates at a time from the Danacaa population. These datasets are referred to as Study 1 to Study 4. We used 3 locus-scoring methods. Firstly, we tested a single locus at a time, referred to as method (i). (For region D2 SNPs 1–27 refer to B03T3041 to C04R0282; and for D4 SNPs 1–38 refer to B09T8321 to B09T8360). Then, using a window covering 6 markers, we applied two variants-the first, (ii), incorporated only single-locus main effects from each of the 6 markers, and the second, (iii), additionally included pair-wise interactions between adjacent loci (11 parameters).
Haplotypes were reconstructed haplotypes from the available genotype data using maximum likelihood methods. Haplotypes were then grouped by hierarchical clustering, defining similarity either by the number of loci shared in common within the window (iv), or the maximum continuous length shared in common without weighting for allele frequency (v), and (inversely) weighting for allele frequency (vi). We tested different numbers of partitions and chose the optimal partition.
In order to better understand our results and, ultimately, the factors that determine the relative performances of the different methods, we investigated the structure of the LD in the studied regions. It is not clear how LD across the two studied regions can be formally compared. The extent of haplotype diversity can be informally judged by the proportion of theoretically possible haplotypes that are actually observed. We noticed that this proportion was markedly higher in the D4 region than it was in the D2 region. If we infer from this that the LD structure is more complex in D4 than D2, we might expect the haplotype methods to perform slightly better in D4 than in D2. We saw some evidence of this, but we recognize that the LD is confounded with the disease model, which differs between the two regions. Below we discuss possible approaches for more formally assessing the performance of LD mapping methods in relation to the LD structure of regions being studied.
Although our results are not encouraging for researchers developing haplotype-sharing methods, it is important to be aware that such methods are dependent on haplotypes with similar risks having a shared ancestry, and the data considered here were not generated in such a way. Even in the D2 region, where the disease susceptibility haplotypes were chosen to be similar, a shared population history was not explicitly modelled and so it is difficult to know how well our results would generalize to real-world problems.
More thorough comparisons of the different strategies for fine mapping are needed to understand which tests are most appropriate and powerful in which situations. The types of comparisons that we have in mind are not possible using only the GAW14 simulated datasets. One approach would be to compare the performances of different methods for LD mapping when phenotypic data is simulated on the basis of different LD structures, in terms of conditional independence structures of markers in genomic regions. We are currently considering the use of log-linear models  for the simulation of LD structure. Log-linear models express logarithms of expected cell counts (in this case, cell counts are haplotype counts) in terms of a linear predictor including main effects and interaction terms (up to an order equal to the number of loci). We vary the highest order of interaction included in the model, as well as the strength of the interaction terms, and study the performance of the various LD mapping methods under different scenarios. The idea is not entirely new. Clayton et al.  have recently examined the extent to which phase is relevant to association, comparing haplotype-based and locus-based tests more formally, with the use of (linear) graphical models. They have considered some simple scenarios, such as under complete LD (where the value of Lewontin's D', but not R 2, is equal to 1 between every pair of markers). The use of graphs in connection with log-linear models is described by Edwards . The choice of test may ultimately be best guided by LD structure within a region, and it is hoped that the types of studies which we have described can shed some light on how to do this in practice.
Genetic Analysis Workshop 14
Identical by state
Markov chain Monte Carlo
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.