Assessment of global phase uncertainty in casecontrol studies
 HaeWon Uh^{1}Email author,
 Jeanine J HouwingDuistermaat^{1},
 Hein Putter^{1} and
 Hans C van Houwelingen^{1}
DOI: 10.1186/147121561054
© Uh et al; licensee BioMed Central Ltd. 2009
Received: 13 January 2009
Accepted: 14 September 2009
Published: 14 September 2009
Abstract
Background
In haplotypebased candidate gene studies a problem is that the genotype data are unphased, which results in haplotype ambiguity. The measure [1] quantifies haplotype predictability from genotype data. It is computed for each individual haplotype, and for a measure of global relative efficiency a minimum value is suggested. Alternatively, we developed methods directly based on the information content of haplotype frequency estimates to obtain global relative efficiency measures: and based on A and Doptimality, respectively. All three methods are designed for single populations; they can be applied in cases only, controls only or the whole data. Therefore they are not necessarily optimal for haplotype testing in casecontrol studies.
Results
A new global relative efficiency measure was derived to maximize power of a simple test statistic that compares haplotype frequencies in cases and controls. Application to real data showed that our proposed method gave a clear and summarizing measure for the casecontrol study conducted. Additionally this measure might be used for selection of individuals, who have the highest potential for improving power by resolving phase ambiguity.
Conclusion
Instead of using relative efficiency measure for cases only, controls only or their combined data, we link uncertainty measure to casecontrol studies directly. Hence, our global efficiency measure might be useful to assess whether data are informative or have enough power for estimation of a specific haplotype risk.
Background
When assessing the relationship between haplotypes and a disease outcome, a problem is that haplotypes are not directly observed. The genotype data are unphased, which results in haplotype ambiguity. This missing phase information causes reduction of the power in haplotype casecontrol studies, and the results may be misleading. Our interest is in two types of analyses; namely global test statistics to compare haplotype frequency distributions between cases and controls, and testing effects of individual haplotypes [2]. An optimal measure to quantify the amount of available information is needed for better understanding of the results obtained. Our main aim therefore is to develop a global relative efficiency measure that is directly based on the test statistic of a casecontrol study.
In the planning stage of casecontrol association studies, haplotypetagging SNPs are often selected to have maximal power based on the pilot study of the target population or using information drawn from the International HapMap (http://www.hapmap.org). For this purpose, Stram et al. [1] proposed that quantifies predictability of the individual haplotype from genotype data. For a measure of global efficiency it was suggested to take the minimum value. Alternatively, Uh et al. [3] developed multivariate methods directly based on the information content of haplotype frequency estimates. The global relative efficiency measures, and , were defined as the ratio of observed information relative to the complete data information based on A and Doptimality [4, 5], respectively. Nicolae [6] also proposed an Aoptimality based measure in a broader framework. The measure reflects the average information of the parameters, and value simply relates to one diagonal element of the observed information matrix [3]. In contrast, the measure takes possible correlations between the parameters into consideration. These three measures ( , and ) can be used for choosing tagSNPs to maximize information content on haplotypes and to maximize the power of the planned study. In the context of casecontrol studies these three measures, which are designed for single populations, are not readily applicable for casecontrol association studies. Therefore we propose a new measure, , which is optimal for assessing global relative efficiency of casecontrol studies using haplotypes.
O'Hely and Slatkin [7] have addressed a similar issue and provided a ratio R based on noncentrality parameters using likelihood ratio statistics. Their methods are based on noncentrality parameters, hence closely related to the issue of sample size in a casecontrol study. In general, enlarging sample sizes improves the power of the study. However, we argue that increasing the number of cases and controls with the same corresponding LD structure has little influence on relative efficiency with respect to phase uncertainty; i.e., resolution of haplotype phase does not depend on the sample size. Here our new relative efficiency measure can be of great assistance to check whether data are informative enough for haplotype casecontrol studies and the results are correctly interpreted. For low values of a relative efficiency measure the haplotypebased inferences should be interpreted with caution even when sample sizes are large.
When conducted studies appear to be not informative enough for haplotype analysis (low values of ), one might want to resolve the haplotype phase. In principle, it is possible to resolve phase uncertainty either by laboratory work which is still costly, or by additional genotyping of family members. However, is it worth while to make these efforts? Regarding costeffectiveness, a forward selection procedure based on the measure is proposed for pinpointing the individuals (cases or controls) who are most responsible for the loss of information due to haplotype uncertainty. These same individuals have the highest potential to increase the power of the casecontrol study by resolving haplotype phase.
We briefly describe our methods for single populations and proceed to derive methods for casecontrol data sets. We illustrate our methods with the Interleukin1β Gene Cluster Data. All computational work has been done using the programming language R [8]. An R program is available at http://www.msbi.nl/uh.
Results
Application to the Interleukin1β Gene Cluster Data
Haplotype frequency estimates of hipROA data
Haplotype  Total  Cases  Controls  

1  111  0.36  0.25  0.37 
2  112  0.08  0.15  0.07 
3  121  0.16  0.27  0.15 
4  122  0.16  0.18  0.16 
5  211  0.20  0.12  0.21 
6  212  0.02  0.03  0.02 
7  221  0.02  0.01  0.03 
Since our example data set was extremely unbalanced  61 cases versus 653 controls and the set of cases may be too small to cover the haplotype structure completely, we generated the more balanced data set of 500 cases and 500 controls based on the real data set. To investigate the performance of global efficiency measures, 1,000 data sets were generated.
Global relative efficiency of the data
Global relative efficiency.
nr of individuals  per group (%)  Casecontrol study (%)  

Total  ambiguous  %  min( ) 



 
hipROA  
control  653  212  32.5  59.3  86.4  89.8  82.3  92.6 
case  61  22  36.1  77.9  81.4  78.5  
Simulated data ^{1}  
control  500  174  34.8  63.7  85.4  79.8  83.2  93.3 
case  500  181  36.2  53.9  88.6  77.3 
Since the high values of and in controls might reflect imbalance of data  casecontrol ratio was about 1/10, we generated 500 cases and 500 controls based on the real data. The 95% confidence intervals based on 1,000 simulations were: ∈ (58.4, 65.5), ∈ (85.4, 89.5), ∈ (71.4, 76.8), ∈ (83.0, 87.4) and ∈ (91.7, 94.8).
Selection of informative individuals
Selection strategy for the subset based on information without taking into account correlations between haplotype frequency estimates.
hipROA data  genotype  nr of individuals  111  112  121  122  211  212  221  loss per genotype  total loss 

Cases n = 61  1HH  10  0.25  0.25  0.25  0.25  0  0  0  1.00  
HHH  7  0  0.03  0.18  0.19  0.19  0.18  0.03  0.79  
H1H  2  0.19  0.19  0  0  0.19  0.19  0  0.77  
HH1  3  0.04  0  0.040  0  0.04  0  0.040  0.16  
no ambiguity  39  
loss per haplotype  3.00  3.07  3.85  3.83  1.85  1.62  0.31  17.52  
Controls n = 653  H1H  28  0.21  0.21  0  0  0.21  0.21  0  0.83  
HH1  46  0.18  0  0.18  0  0.18  0  0.18  0.72  
1HH  91  0.12  0.12  0.12  0.12  0  0  0  0.49  
HHH  47  0  0.04  0.06  0.10  0.10  0.06  0.04  0.40  
no ambiguity  441  
loss per haplotype  25.29  19.09  22.23  15.760  18.59  8.52  10.34  119.81  
Simulated data  genotype  nr of individuals  111  112  121  122  211  212  221  loss per genotype  total loss 
Cases n = 500  1HH  83  0.25  0.25  0.25  0.25  0  0  0  1.00  
HHH  40  0  0.03  0.11  0.13  0.13  0.11  0.03  0.55  
H1H  26  0.11  0.11  0  0  0.11  0.11  0  0.15  
HH1  32  0.04  0  0.04  0  0.04  0  0.04  0.15  
no ambiguity  319  
loss per haplotype  24.80  24.94  26.26  26.07  9.36  7.17  2.52  121.12  
Controls n = 500  H1H  25  0.23  0.23  0  0  0.23  0.23  0  0.93  
HH1  36  0.21  0  0.21  0  0.21  0  0.21  0.83  
HHH  43  0  0.05  0.06  0.11  0.11  0.06  0.05  0.44  
1HH  70  0  0.11  0.11  0.11  0.11  0  0  0.42  
no ambiguity  326  
loss per haplotype  20.68  15.23  17.65  11.89  17.86  8.61  9.55  101.47 
Discussion
For casecontrol association studies using haplotypes it is of great importance to evaluate the data set whether it is appropriate to conduct haplotypebased analysis. This step enables us to interpret the results correctly. Therefore, we developed a global relative efficiency measure, , which was directly based on the test statistic of a casecontrol study. For testing a subset of haplotypes, s, we proposed .
It has been noted that the extent of LD can be different between the case and control groups in a candidate region [13]. Our study also showed that the uncertainty of data clearly depends on the specific structure of data used. The values were comparable using a unbalanced data set (the HipROA data) as well as using balanced simulated data sets which supposedly have the same structure as the real data. When the data are not informative enough to conduct haplotypebased analyses, say ≤ 90%, tow options can be considered. One is to select individuals who have the highest potential to increase the power by resolving haplotypes, as discussed in the results section. The second is to make haplotype blocks [14] smaller until a presset value is reached, whose limit would be the block containing a single SNP.
We did not address here which methods could be used to enlarge the efficiency of the study. It may be argued that the phase resolution by laboratory work is too costly. However, simply genotyping more individuals does not help in resolving phase ambiguity, assuming that additional cases and controls were selected from comparable populations as in the original data. For lateonset diseases it would not be possible to obtain samples of parents. However, in the planning stage of some studies, expected (remaining) information loss after genotyping parents could be calculated to make a balanced decision. In the same way, adding familial information from the sibling pairs could be an option. Putter et al. [12] showed that adding a sib increases information by 1/2 compared to adding parents, and adding the second sib by (1/2)^{2}, the third sib by (1/2)^{3} etc. That is, we need 4 or 5 sibs to obtain 90% of information by adding parents. Our methods are based on the assumption of HardyWeinberg equilibrium (HWE) in sample haplotype frequencies, in addition to a multiplicative model. Therefore, our relative efficiency measure would be influenced by the departure from HWE. As our Tstatistic can be considered as a multiallelic test, which is known to have inflated type 1 error rates when HWE is not satisfied [15, 16]. Satten and Epstein [17] showed that the both prospective and retrospective approaches with a multiplicative model is robust to the HWE assumption in the target population. In the same paper, they also showed that the retrospective approach, which we used in our statistic, is superior to the prospective one. When the departure from HWE cannot be ignored, for example caused by inbreeding and population stratification, a variant of based on retrospective likelihood can be developed using a fixation index.
Conclusion
To assess the relative efficiency for haplotype testing in a casecontrol study, we developed methods based on the Tstatistic as described in the Methods section. This measure indicates how much information is contained compared to the fully phased data for haplotype analysis in casecontrol studies. We also showed how this measure can be used for optimal selection of individuals who contribute most to information gain by resolving phase ambiguity.
By applying to the real data, we obtained the global relative efficiency = 82.3% for haplotype analysis. Focusing on only two haplotype that are found significantly associated with disease, we obtained = 92:6%.
Methods
Quantification of global relative efficiency in a sample
Note that the parameter vector α is not completely identifiable. We first derive all the formulas as if there is no constraint on α, and when necessary we transform them to the appropriate parameter space.
where (·^{}) denotes the MoorePenrose generalized inverse [18].
The last expression is obtained by Taylor approximation given that is small, and it shows that loss of information will cause increase in the covariance of estimates. When we have no ambiguities in the data, ℒ_{ i }equals to zero, and the covariance becomes simply C/(2n) in (1).
I can be computed as I = J^{T}I*(α)J. From now on, the Fisher information as well as covariance matrices are assumed to be properly transformed into an appropriate parameter space.
where I denotes the determinant of the matrix, and calculated as a product of nonzero eigenvalues. Note that this measure is invariant to transformation of parameters. High values of and indicate that data are informative to estimate haplotype frequencies.
where I_{s, s}is 2 × 2 matrix with respect to π_{ s }. The information content with respect to this subset s amounts to .
Quantification of global relative efficiency in casecontrol studies
and relative efficiency is denoted as
In order to select the most informative individuals in a case control study, the forward stepwise selection procedure could be employed for maximizing the power of global test T; i.e., it is determined which multilocus combination of genotypes provides most information gain, when the phase ambiguity is resolved.
Declarations
Acknowledgements
This paper originates from the GENOMEUTWIN project which is supported by the European Union Contract No. QLG2CT200201254. We thank Dr. Ingrid Meulenbelt for providing us the Interleukin1β Gene Cluster Data.
Authors’ Affiliations
References
 Stram DO, Haiman JN, Hirschhorn JN, Altshuler D, Kolonel LN, Henderson BE, Pike ML: Choosing haplotypetagging SNPs based on unphased genotype data using a preliminary sample of unrelated subjects with an example from the multiethnic cohort study. Hum Hered. 2003, 55: 2736. 10.1159/000071807.View ArticlePubMedGoogle Scholar
 Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA: Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am J Hum Genet. 2002, 70: 425434. 10.1086/338688.PubMed CentralView ArticlePubMedGoogle Scholar
 Uh HW, HouwingDuistermaat JJ, Putter H, van Houwelingen JJC: How to quantify information loss due to phase ambiguity in haplotype casecontrol studies. BMC Genet. 2005, 6 (Suppl 1): S10810.1186/147121566S1S108.PubMed CentralView ArticlePubMedGoogle Scholar
 Atkinson AC, Donev AN: Optimum Experimental Designs. 1992, Oxford: Oxford Statistical Science Series, 8:Google Scholar
 Fedorov V: Theory of optimal experiments. 1972, New York: Academic PressGoogle Scholar
 Nicolae DL: Quantifying the amount of missing information in gentic association studies. Genet Epi. 2006, 30: 703717. 10.1002/gepi.20181.View ArticleGoogle Scholar
 O'Hely M, Slatkin M: The loss of statistical power to distinguish population when certain samples are ambiguous. Theor Pop Biol. 2003, 64: 177192. 10.1016/S00405809(03)000844.View ArticleGoogle Scholar
 The R project for statistical computing. [http://www.rproject.org/]
 Hofman A, Grobbee D, de Jong PT, Ouweland van den FA: Determinants of disease and disability in elderly: the Rotterdam Elderly Study. Eur J Epi. 1991, 7: 403422. 10.1007/BF00145007.View ArticleGoogle Scholar
 Meulenbelt I, Seymour AB, Nieuwland M, Huizinga TWJ, van Duijn CM, Slagboom PE: Association of the Interleukin1 gene cluster with radiographic signs of osteoarthritis of the hip. Arthritis & Rheumatism. 2004, 50 (4): 11791186. 10.1002/art.20121.View ArticleGoogle Scholar
 Tregouet DA, S E, Tiret L, Mallet A, Golmard JL: A new maximum likelihood algorithm for haplotypebased association analysis: the SEM algorithm. Ann Hum Genet. 2003, 68: 165177. 10.1046/j.15298817.2003.00085.x.View ArticleGoogle Scholar
 Putter H, Meulenbelt I, van Houwelingen JJC: Relative efficiency of haplotype frequency estimation in sibships and nuclear families compared to unrelated individuals. Hum Hered. 2007, 64: 5262. 10.1159/000101423.View ArticlePubMedGoogle Scholar
 Zaykin DV, Meng Z, Ehm MG: Contrasting linkagedisequilibrium patterns between cases and controls as a novel associationmapping method. Am J Hum Genet. 2006, 78: 737746. 10.1086/503710.PubMed CentralView ArticlePubMedGoogle Scholar
 van Minkelen R, de Visser MC, HouwingDuistermaat JJ, Vos HL, Bertina RM, Rosendaal FR: Haplotypes of IL1B, IL1RN, and IL1R2 and the risk of venous thrombosis. Arterioscler Thromb Vasc Biol. 2007, 27: 14861491. 10.1161/ATVBAHA.107.140384.View ArticlePubMedGoogle Scholar
 Zheng G: Can the allelic test be retired from analysis of casecontrol association studies?. Ann Hum Genet. 2008, 72: 848851. 10.1111/j.14691809.2008.00466.x.PubMed CentralView ArticlePubMedGoogle Scholar
 Sasieni PD: From genotypes to genes: doubling the sample size. Biometrics. 1997, 53: 12531261. 10.2307/2533494.View ArticlePubMedGoogle Scholar
 Satten GA, Epstein MP: Comparison of prospective and retreospective methods for haplotype inference in casecontrol studies. Genet Epi. 2004, 27: 192201. 10.1002/gepi.20020.View ArticleGoogle Scholar
 Rao CR, Mitra SK: Generalized Inverse of Matrices and Its Applications. 1971, New York: John Wiley & SonsGoogle Scholar
 Louis TA: Finding the observed information matrix when using the EM algorithm. J R Stat Soc. 1982, 44 (2): 226233.Google Scholar
 Lehmann EL: Theory of point estimation. 1983, New York: John Wiley & SonsView ArticleGoogle Scholar
 Minkin S: Optimal Designs for Binary Data. J Amer Stat Assoc. 1987, 82: 10981103. 10.2307/2289386.View ArticleGoogle Scholar
 Heise MA, Myers RH: Optimal Designs for Bivariate Logistic Regression. Biometrics. 1996, 52: 613624. 10.2307/2532900.View ArticleGoogle Scholar
 Cox DR, Hinkley DV: Theoretical Statistics. 1974, London: Chapman and HallView ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.