Volume 6 Supplement 1
Genetic Analysis Workshop 14: Microsatellite and singlenucleotide polymorphism
How to quantify information loss due to phase ambiguity in haplotype casecontrol studies
 HaeWon Uh^{1}Email author,
 Jeanine J HouwingDuistermaat^{1},
 Hein Putter^{1} and
 Hans C van Houwelingen^{1}
DOI: 10.1186/147121566S1S108
© Uh et al; licensee BioMed Central Ltd 2005
Published: 30 December 2005
Abstract
Assigning haplotypes in a casecontrol study is a challenging problem. We proposed a method to quantify the information loss due to missing phase information. We determined which individuals were responsible for the information loss, and calculated how much information could be gained when the ambiguous individuals could be resolved by adding additional parental information.
Background
Currently the majority of association studies using singlenucleotide polymorphism (SNP) markers for complex diseases are casecontrol diseasemarker studies. In this paper, we consider a limited number of SNPs within a candidate region, with the aim of estimating haplotype frequencies and haplotype effects on disease status. This approach requires information about how to assign haplotypes from the observed genotypes. This phase information can be inferred using statistical procedures such as the expectationmaximization (EM) algorithm.
As Hodge et al. [1] showed, in general the probability of not being able to assign haplotypes with certainty increases with the number of the loci, and with the allele frequencies approaching 0.5. Accepting the "best" configuration of haplotypes as the "real" haplotype without critically examining data it might lead to misleading results. Therefore it might be useful to screen data beforehand using some measure of uncertainty.
There exists software with an option to print out all possible haplotype configurations with corresponding posterior probabilities. We wondered whether we could use this extra information to settle some of the current issues in haplotype analysis: how do you determine which individuals are responsible for the information loss, and how much information do we gain when parental genotypes were available?
With these issues in mind, we first defined the information loss as complete data information (without uncertainty) minus the observed information [2]. Under the assumption of HardyWeinberg equilibrium (HWE), we first considered the information content of each individual according to the diagonal elements of the information matrix. Considering the correlation between haplotypes, we employed Doptimality [3], which maximizes the determinant of the observed information matrix. With this measure, forward stepwise selection was applied to select the individuals that potentially yield the largest gain in information.
Methods
Suppose we have a sample of n unrelated individuals from a population. From each individual we observe m multilocus SNP genotypes. Under HWE, the distribution of haplotypes is assumed to be multinomial, and the joint distribution of the paired haplotypes is equal to the product of the two marginal distributions. The haplotype will be described by a k(= 2^{ m }) dimensional vector H with its elements 0 or 1, and P(H_{ i }= 1) = π_{ i }denotes the frequency of haplotype i ∈ {1, ..., k}. If there is no uncertainty, then for an (ordered) haplotype pairs (H_{1}, H_{2}) of one individual, j may be described with a kvector H_{ ind, j }= H_{1} + H_{2}, where H_{ ind, j }∈ {0, 1, 2}, socalled haplotype dosage. Let C denote the covariance matrix of H as follows:
We first investigated the diagonal elements of L_{ i }in cases. Although the use of the trace of Lmatrix (Aoptimality [3]) is an intuitive method to select individuals who need additional information, it does not consider the possible correlations of the parameters. Instead we propose to maximize the determinant of the information matrix based on Doptimality [3].
Results
After performing a linkage analysis for the microsatellite markers, we analyzed SNP packet 153, including the microsatellite marker D03S0127 and 19 SNPs. Our example casecontrol data consist of 200 unrelated subjects and three loci. The case population consists of 100 affected offsprings selected from each family of Danacaa population replicate 8. To select a suitable subregion for our purpose we employed the sliding scores [4], and decided to study threelocus haplotypes based on B03T3056, B03T3057, and B03T3058. The computations were done with the programming language R [5].
Information loss per haplotype based on the diagonal of information matrix in 100 cases, and R^{2} measure.
Haplotype  cases  R ^{2}  

nr  SNPs  Loss  max^{a}  %^{b}  
1  111  9.51  37.18  25.57  0.7443 
2  112  5.74  18.64  30.79  0.6921 
3  121  8.11  43.28  18.74  0.8126 
4  122  4.34  9.05  47.93  0.5206 
5  211  4.49  10.11  44.39  0.5560 
6  212  2.69  5.02  53.58  0.4641 
7  221  5.35  16.07  33.32  0.6668 
8  222  3.54  20.77  17.06  0.8293 
Loss of information in 100 cases per individual and per haplotype: '1' and '2' represent homozygotes 1/1 and 2/2, 'H' heterozygote1/2.
Group  Genotype  No.  111  112  121  122  211  212  221  222  Tot. Loss^{a} 

1  HHH  11  0.241  0.152  0.139  0.049  0.049  0.139  0.152  0.241  1.163 
2  H1H  4  0.249  0.249  0.249  0.249  0.996  
3  HH1  12  0.246  0.246  0.246  0.246  0.984  
4  1HH  15  0.194  0.194  0.194  0.194  0.774  
5  H2H  8  0.091  0.091  0.091  0.091  0.363  
6  HH2  2  0.083  0.083  0.083  0.083  0.331  
7  2HH  0  0.195  0.195  0.195  0.195  0.780  
OK  48  
Total  100  9.506  5.739  8.113  4.337  4.490  2.690  5.354  3.545  43.77 
Observe that the above results are valid under the assumption that we could completely resolve the ambiguous haplotypes. When we actually added the parental information for this data, we could resolve about 71% of ambiguous individuals (number of cases = 100). Because it would depend heavily on the structure of data, for general usage we calculated the expected loss conditional on all possible parental genotypes. Using Aoptimality, approximately 65% of information loss in average could be recovered.
Conclusions and Discussions
The expected loss considering all possible (and compatible) parental genotypes does not differ much between the genotypic groups; it does not matter whether the individual is heterozygous on 2 loci, or 3 loci. For example, all heterozygous individuals might have two heterozygous parents (HHH), or two homozygous parents (father with type 111, mother 222). It clearly depends on the allele frequencies, hence on the structure of data. Our ongoing investigation shows that the selection patterns also depend strongly on the questions asked; that is, whether we are interested in each group, in pooled groups, or in terms of haplotype risks in "minimizing error" or in "maximizing power".
Although selecting the informative individuals based on Aoptimality is not as accurate as the method based on Doptimality, it is an intuitive method to understand the structure of uncertainty of the data. However, in some situations when the correlations of the parameters are not ignorable, our proposed methods might give more insight into the data. In our future work, we will investigate haplotype effects on disease status and some other extensions: focusing on "interesting" haplotypes, including missing data, or studying the behavior with an increasing number of SNPs.
Abbreviations
 EM:

Expectation maximization
 HWE:

HardyWeinberg equilibrium
 SNP:

Singlenucleotide polymorphism
Declarations
Acknowledgements
This paper originates from the GENOMEUTWIN project that is supported by the European Union Contract No. QLG2CT200201254.
Authors’ Affiliations
References
 Hodge SE, Boenke M, Spence MA: Loss of information due to ambiguous haplotyping. Nat Genet. 1999, 21: 360361. 10.1038/7687.View ArticlePubMedGoogle Scholar
 Louis T: Finding the observed information matrix when using the EM algorithm. J Roy Stat Soc B Met. 1982, 44: 226233.Google Scholar
 Fedorov VV: Theory of Optimal Experiments. 1972, New York: Academic PressGoogle Scholar
 Clayton D, Jones H: Transmission/disequilibrium tests for extended marker haplotypes. Am J Hum Genet. 1999, 65: 11611169. 10.1086/302566.PubMed CentralView ArticlePubMedGoogle Scholar
 R Development Core Team: R: A language and environment for statistical computing. R Found Stat Comput. Vienna, Austria, ISBN 3900051003
 Stram DO, Leigh Pearce C, Bretsky P, Freedman M, Hirschhorn JN, Altshuler D, Kolonel LN, Henderson BE, Thomas DC: Modeling and EM estimation of haplotypespecific relative risks from genotype data for a casecontrol study of unrelated individuals. Hum Hered. 2003, 55: 179190. 10.1159/000073202.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.