PCAbased bootstrap confidence interval tests for genedisease association involving multiple SNPs
 Qianqian Peng^{1},
 Jinghua Zhao^{2}Email author and
 Fuzhong Xue^{1}Email author
DOI: 10.1186/14712156116
© Peng et al; licensee BioMed Central Ltd. 2010
Received: 6 December 2008
Accepted: 26 January 2010
Published: 26 January 2010
Abstract
Background
Genetic association study is currently the primary vehicle for identification and characterization of diseasepredisposing variant(s) which usually involves multiple singlenucleotide polymorphisms (SNPs) available. However, SNPwise association tests raise concerns over multiple testing. Haplotypebased methods have the advantage of being able to account for correlations between neighbouring SNPs, yet assuming HardyWeinberg equilibrium (HWE) and potentially large number degrees of freedom can harm its statistical power and robustness. Approaches based on principal component analysis (PCA) are preferable in this regard but their performance varies with methods of extracting principal components (PC s).
Results
PCAbased bootstrap confidence interval test (PCABCIT), which directly uses the PC scores to assess genedisease association, was developed and evaluated for three ways of extracting PC s, i.e., cases only(CAES), controls only(COES) and cases and controls combined(CES). Extraction of PC s with COES is preferred to that with CAES and CES. Performance of the test was examined via simulations as well as analyses on data of rheumatoid arthritis and heroin addiction, which maintains nominal level under null hypothesis and showed comparable performance with permutation test.
Conclusions
PCABCIT is a valid and powerful method for assessing genedisease association involving multiple SNPs.
Background
Genetic association studies now customarily involve multiple SNPs in candidate genes or genomic regions and have a significant role in identifying and characterizing diseasepredisposing variant(s). A critical challenge in their statistical analysis is how to make optimal use of all available information. Populationbased casecontrol studies have been very popular[1] and typically involve contingency table tests of SNPdisease association[2]. Notably, the genotypewise Armitage trend test does not require HWE and has equivalent power to its allelewise counterpart under HWE[3, 4]. A thorny issue with individual tests of SNPs for linkage disequilibrium (LD) in such setting is multiple testing, however, methods for multiple testing adjustment assuming independence such as Bonferroni's[5, 6] is knowingly conservative[7]. It is therefore necessary to seek alternative approaches which can utilize multiple SNPs simultaneously. The genotypewise Armitage trend test is appealing since it is equivalent to the score test from logistic regression[8] of casecontrol status on dosage of diseasepredisposing alleles of SNP. However, testing for the effects of multiple SNPs simultaneously via logistic regression is no cure for difficulty with multicollinearity and curse of dimensionality[9]. Haplotypebased methods have many desirable properties[10] and could possibly alleviate the problem[11–14], but assumption of HWE is usually required and a potentially large number of degrees of freedom are involved[7, 11, 15–18].
It has recently been proposed that PCA can be combined with logistic regression test (LRT)[7, 16, 17] in a unified framework so that PCA is conducted first to account for betweenSNP correlations in a candidate region, then LRT is applied as a formal test for the association between PC scores (linear combinations of the original SNPs) and disease. Since PC s are orthogonal, it avoids multicollinearity and at the meantime is less computerintensive than haplotypebased methods. Studies have shown that PCALRT is at least as powerful as genotype and haplotypebased methods[7, 16, 17]. Nevertheless, the power of PCAbased approaches vary with ways by which PC s are extracted, e.g., from genotype correlation, LD, or other kinds of metrics[17], and in principle can be employed in frameworks other than logistic regression[7, 16, 17]. Here we investigate ways of extracting PCs using genotype correlation matrix from different types of samples in a casecontrol study, while presenting a new approach testing for genedisease association by direct use of PC scores in a PCAbased bootstrap confidence interval test (PCABCIT). We evaluated its performance via simulations and compared it with PCALRT and permutation test using real data.
Methods
PCA
where cov (F_{ i }, F_{ j }) = 0, i ≠ j, and var(F_{1}) ≥ var(F_{2}) ≥ ⋯ ≥ var(F_{ p }).
Methods of extracting PC s
Potentially, PCA can be conducted via four distinct extracting strategies (ES) using casecontrol data, i.e., 0. Calculate PC scores of individuals in cases and controls separately (SES), 1. Use cases only (CAES) to obtain loadings for calculation of PC scores for subjects in both cases and controls, 2. Use controls only (COES) to obtain the loadings for both groups, and 3. Use combined cases and controls (CES) to obtain the loadings for both groups. It is likely that in a casecontrol association study, loadings calculated from cases and controls can have different connotations and hence we only consider scenarios 13 hereafter. More formally, let (X_{1}, X_{2}, ⋯, X_{ p }) and (Y_{1}, Y_{2}, ⋯, Y_{ p }) be pdimension vectors of SNPs at a given candidate region for cases and controls respectively, then we have,
PCABCIT
Given a sample of N cases and M controls with pSNP genotypes (X_{1}, X_{2}, ⋯, X_{ N })^{ T }, (Y_{1}, Y_{2}, ⋯, Y_{ M })^{ T }, and X_{ i }= (X_{1i}, X_{2i}, ⋯, x_{ pi }) for the i^{ th }case, Y_{ i }= (Y_{1i}, Y_{2i}, ⋯, y_{ pi }) for the i^{ th }control, a PCABCIT is furnished in three steps:
Step 1: Sampling
Replicate samples of cases and controls are obtained with replacement separately from (X_{1}^{(b}, X_{2}^{(b)}, ⋯, X_{ N }^{(b)})^{ T }and (Y_{1}^{(b}, Y_{2}^{(b)}, ⋯, Y_{ M }^{(b)})^{ T }, b = 1,2, ⋯, B (B = 1000).
Step 2: PCA
For each replicate sample obtained at Step 1, PCA is conducted and a given number of PC s retained with a threshold of 80% explained variance for all three strategies[16], expressed as and .
Step 3: PCABCIT
where is the percentile of , and is the percentile.
where is the percentile of , and is the percentile.
3c) Confidence intervals of cases and controls are compared. The null hypothesis is rejected if and do not overlap, which is and are statistically different[19], indicating the candidate region is significantly associated with disease at level α. Otherwise, the candidate region is not significantly associated with disease at level α.
Simulation studies
Step 1: Sampling
The observed genotype frequencies in the study sample are taken to be their true frequencies in populations of infinite sizes. Replicate samples of cases and controls of given size (N, N = 100, 200, ⋯, 1000) are generated whose estimated genotype frequencies are expected to be close to the true population frequencies while both the allele frequencies and LD structure are maintained. Under null hypothesis, replicate cases and controls are sampled with replacement from the controls. Under alternative hypothesis, replicate cases and controls are sampled with replacement from the cases and controls respectively.
Step 2: PCABCITing
For each replicate sample, PCABCITs are conducted through the three strategies of extracting PC s as outlined above on association between PC scores and disease (RA).
Step 3: Evaluating performance of PCABCIT s
Repeat steps 1 and 2 for K ( K = 1000 ) times under both null and alternative hypotheses, and obtain the frequencies (P_{ α }) of rejecting null hypothesis at level α (α = 0.05).
Applications
Results
Simulation study
Performance of PCABCIT at level 0.05 with strategies 13†
Sample size  Type I error  Power  

1  2  3  1  2  3  
100  0.014  0.036  0.037  0.156  0.163  0.176 
200  0.016  0.044  0.036  0.249  0.278  0.292 
300  0.017  0.028  0.029  0.383  0.426  0.368 
400  0.014  0.04  0.02  0.508  0.485  0.516 
500  0.009  0.035  0.042  0.613  0.595  0.597 
600  0.006  0.032  0.042  0.677  0.662  0.683 
700  0.007  0.061  0.04  0.733  0.758  0.73 
800  0.004  0.043  0.045  0.801  0.791  0.819 
900  0.005  0.057  0.051  0.826  0.855  0.858 
1000  0.01  0.056  0.05  0.871  0.901  0.889 
Applications
Armitage trend test on nine PTPN2 2 SNPs and RA susceptibility
SNP  Genotype  Female  Male  

Case  Control  P value  Case  control  P value  
rs971173  CC  334  381  0.025  116  169  0.779 
AC  236  363  85  134  
AA  71  106  26  39  
rs1217390  AA  268  319  0.333  99  112  0.108 
AG  272  392  89  175  
GG  98  138  38  55  
rs878129  GG  338  507  0.009  131  187  0.384 
AG  251  291  83  130  
AA  52  54  13  25  
rs11811771  AA  224  272  0.090  78  111  0.717 
AG  303  411  104  168  
GG  112  169  45  62  
rs11102703  CC  312  469  0.024  121  174  0.418 
AC  269  314  90  137  
AA  60  69  16  31  
rs7545038  GG  321  428  0.696  109  186  0.417 
AG  265  342  98  114  
AA  52  80  20  40  
rs1503832  AA  324  487  0.013  129  185  0.249 
AG  262  306  86  127  
GG  55  59  12  30  
rs12127377  AA  349  521  0.017  139  197  0.230 
AG  243  282  78  121  
GG  49  48  10  24  
rs11485101  AA  564  738  0.656  206  305  0.430 
AG  72  112  21  35  
GG  5  2  0  2 
PCABCIT, PCALRT and permutation test on real data
Study  Strategy†  99%CI  95%CI  Pvalue‡  

PCALRT  Permutation test  
PTPN22  2  (5.4E01,4.7E03)** (7.5E16,6.9E16)  (4.8E01,8.6E02)* (4.6E16,4.2E16)  0.006**  0.002** 
3  (1.7E02,3.3E01)** (2.5E01,1.3E02)  (4.9E02,3.0E01)* (2.2E01,3.7E02)  0.007**  0.002**  
OPRM1  2  (1.2E+00,1.1E02)** (4.7E16,5.0E16)  (1.1E+00,1.8E01)* (3.7E16,3.4E16)  0.107  0.002** 
3  (5.3E02,1.4E+00)** (4.9E01,1.7E02)  (2.4E01,1.2E+00)* (4.2E01,8.0E02)  0.012*  0.004** 
Sample characteristics of heroininduced positive responses on first use
Cases (N= 91)  Controls (N= 245)  Pvalue  

Age (yrs)  30.42 ± 7.65  30.93 ± 8.18  0.6057 
Women (%)  26.4  29.8  0.5384 
Age at onset (yrs)  26.29 ± 7.41  26.97 ± 7.89  0.4760 
Reason for first use of heroin  0.7173  
Curiousness  79.1  75.1  
Peer pressure  6.6  4.9  
Physical disease  7.7  10.2  
Trouble  5.5  6.1  
Other reasons  1.1  3.8 
Armitage trend tests on nine OPRM1 SNPs and heroininduced positive responses on first use
SNP  Genotype  Count and frequency  Armitage trend test  

Cases  Controls  Chisquare  P value  
rs1799971  AA  55  0.604  150  0.622  0.003  0.9537 
AG  27  0.297  64  0.266  
GG  9  0.099  24  0.112  
rs510769  TT  56  0.667  167  0.749  2.744  0.0976 
TC  24  0.286  53  0.237  
CC  4  0.048  4  0.018  
rs696522  AA  64  0.762  215  0.907  11.097  0.0009* 
AG  19  0.226  21  0.089  
GG  1  0.012  1  0.004  
rs1381376  CC  70  0.769  221  0.913  13.409  0.0003* 
CT  20  0.220  21  0.087  
TT  1  0.011  0  0.000  
rs3778151  GG  66  0.733  215  0.896  14.655  0.0001* 
GA  23  0.256  25  0.104  
AA  1  0.011  0  0.000  
rs2075572  GG  50  0.556  149  0.642  1.574  0.2096 
GC  33  0.367  82  0.353  
CC  7  0.078  11  0.047  
rs533586  TT  68  0.840  203  0.868  0.761  0.3830 
TC  12  0.148  31  0.132  
CC  1  0.012  0  0.000  
rs550014  TT  78  0.857  203  0.832  0.093  0.7602 
TC  12  0.132  41  0.168  
CC  1  0.011  0  0.000  
rs658156  GG  65  0.714  192  0.787  2.041  0.1531 
GA  24  0.264  52  0.213  
AA  1  0.011  0  0.000 
Discussion
In this study, a PCAbased bootstrap confidence interval test[19, 26–28] (PCABCIT) is developed to study genedisease association using all SNPs genotyped in a given region. There are several attractive features of PCAbased approaches. First of all, they are at least as powerful as genotype and haplotypebased methods[7, 16, 17]. Secondly, they are able to capture LD information between correlated SNPs and easy to compute with needless consideration of multicollinearity and multiple testing. Thirdly, BCIT integrates point estimation and hypothesis testing as a single inferential statement of great intuitive appeal[29] and does not rely on the distributional assumption of the statistic used to calculate confidence interval[19, 26–29].
While there have been several different but closely related forms of bootstrap confidence interval calculations[28], we focus on percentiles of the asymptotic distribution of PC s for given confidence levels to estimate the confidence interval. PCABCIT is a datalearning method[29], and shown to be valid and powerful for sufficiently large number of replicates in our study. Our investigation involving three strategies of extracting PC s reveals that strategy 1 is invalid, while strategies 2 and 3 are acceptable. From analyses of real data we find that PCABCIT is more favourable compared with PCALRT and permutation test. It is suggested that a practical advantage of PCABCIT is that it offers an intuitive measure of difference between cases and controls by using the set of SNPs (PC scores) in a candidate region (Figure 3). As extraction of PC s through COES is more in line with the principle of a casecontrol study, it will be our method of choice given that it has a comparable performance with CES. Nevertheless, PCABCIT has the limitation that it does not directly handle covariates as is usually done in a regression model.
Conclusions
PCABCIT is both a valid and a powerful PCAbased method which captures multiSNP information in study of genedisease association. While extracting PC s based on CAES, COES and CES all have good performances, it appears that COES is more appropriate to use.
Abbreviations
 SNP :

single nucleotide polymorphism
 HWE :

HardyWeinberg Equilibrium
 LD :

linkage disequilibrium
 LRT :

logistic regression test
 PCA :

principle component analysis
 PC :

principle component
 ES :

extracting strategy
 SES :

separate case and control extracting strategy (strategy 0)
 CAES :

casebased extracting strategy (strategy 1)
 COES :

controlbased extracting strategy (strategy 2)
 CES :

combined case and control extracting strategy (strategy 3)
 BCIT :

bootstrap confidence interval test.
Declarations
Acknowledgements
This work was supported by grant from the National Natural Science Foundation of China (30871392). We wish to thank Dr. Dandan Zhang (Fudan University) and NARAC for supplying us with the data, and comments from the Associate Editor and anonymous referees which greatly improved the manuscript. Special thanks to referee for the insightful comment that extraction of PC s with controls is line with the casecontrol principles.
Authors’ Affiliations
References
 Morton NE, Collins A: Tests and estimates of allelic association in comples. Proc Natl Acad Sci USA. 1998, 95: 1138911393. 10.1073/pnas.95.19.11389.PubMed CentralView ArticlePubMedGoogle Scholar
 Sasieni PD: From genotypes to genes: doubling the sample size. Biometrics. 1997, 53: 12531261. 10.2307/2533494.View ArticlePubMedGoogle Scholar
 Gordon D, Haynes C, Yang Y, Kramer PL, Finch SJ: Linear trend tests for casecontrol genetic association that incorporate random phenotype and genotype misclassification error. Genet Epidemiol. 2007, 31: 853870. 10.1002/gepi.20246.View ArticlePubMedGoogle Scholar
 Slager SL, Schaid DJ: Casecontrol studies of genetic markers: Power and sample size approximations for Armitage's test for trend. Human Heredity. 2001, 52: 149153. 10.1159/000053370.View ArticlePubMedGoogle Scholar
 Sidak Z: On Multivariate Normal Probabilities of Rectangles: Their Dependence on Correlations. The Annals of Mathematical Statistics. 1968, 39: 14251434.Google Scholar
 Sidak Z: On Probabilities of Rectangles in Multivariate Student Distributions: Their Dependence on Correlations. The Annals of Mathematical Statistics. 1971, 42: 169175. 10.1214/aoms/1177693504.View ArticleGoogle Scholar
 Zhang FY, Wagener D: An approach to incorporate linkage disequilibrium structure into genomic association analysis. Journal of Genetics and Genomics. 2008, 35: 381385. 10.1016/S16738527(08)600557.PubMed CentralView ArticlePubMedGoogle Scholar
 Balding DJ: A tutorial on statistical methods for population association studies. Nature Reviews Genetics. 2006, 7: 781791. 10.1038/nrg1916.View ArticlePubMedGoogle Scholar
 Schaid DJ, McDonnell SK, Hebbring SJ, Cunningham JM, Thibodeau SN: Nonparametric tests of association of multiple genes with human disease. American Journal of Human Genetics. 2005, 76: 780793. 10.1086/429838.PubMed CentralView ArticlePubMedGoogle Scholar
 Becker T, Schumacher J, Cichon S, Baur MP, Knapp M: Haplotype interaction analysis of unlinked regions. Genetic Epidemiology. 2005, 29: 313322. 10.1002/gepi.20096.View ArticlePubMedGoogle Scholar
 Chapman JM, Cooper JD, Todd JA, Clayton DG: Detecting disease associations due to linkage disequilibrium using haplotype tags: A class of tests and the determinants of statistical power. Human Heredity. 2003, 56: 1831. 10.1159/000073729.View ArticlePubMedGoogle Scholar
 Epstein MP, Satten GA: Inference on haplotype effects in casecontrol studies using unphased genotype data. American Journal of Human Genetics. 2003, 73: 13161329. 10.1086/380204.PubMed CentralView ArticlePubMedGoogle Scholar
 Fallin D, Cohen A, Essioux L, Chumakov I, Blumenfeld M, Cohen D, Schork NJ: Genetic analysis of case/control data using estimated haplotype frequencies: Application to APOE locus variation and Alzheimer's disease. Genome Research. 2001, 11: 143151. 10.1101/gr.148401.PubMed CentralView ArticlePubMedGoogle Scholar
 Stram DO, Pearce CL, Bretsky P, Freedman M, Hirschhorn JN, Altshuler D, Kolonel LN, Henderson BE, Thomas DC: Modeling and EM estimation of haplotypespecific relative risks from genotype data for a casecontrol study of unrelated individuals. Human Heredity. 2003, 55: 179190. 10.1159/000073202.View ArticlePubMedGoogle Scholar
 Clayton D, Chapman J, Cooper J: Use of unphased multilocus genotype data in indirect association studies. Genetic Epidemiology. 2004, 27: 415428. 10.1002/gepi.20032.View ArticlePubMedGoogle Scholar
 Gauderman WJ, Murcray C, Gilliland F, Conti DV: Testing association between disease and multiple SNPs in a candidate gene. Genetic Epidemiology. 2007, 31: 383395. 10.1002/gepi.20219.View ArticlePubMedGoogle Scholar
 Oh S, Park T: Association tests based on the principalcomponent analysis. BMC Proc. 2007, 1 (Suppl 1): S13010.1186/175365611s1s130.PubMed CentralView ArticlePubMedGoogle Scholar
 Wang T, Elston RC: Improved power by use of a weighted score test for linkage disequilibrium mapping. American Journal of Human Genetics. 2007, 80: 353360. 10.1086/511312.PubMed CentralView ArticlePubMedGoogle Scholar
 Heller G, Venkatraman ES: Resampling procedures to compare two survival distributions in the presence of rightcensored data. Biometrics. 1996, 52: 12041213. 10.2307/2532836.View ArticleGoogle Scholar
 Plenge RM, Seielstad M, Padyukov L, Lee AT, Remmers EF, Ding B, Liew A, Khalili H, Chandrasekaran A, Davies LRL, et al: TRAF1C5 as a risk locus for rheumatoid arthritis  A genomewide study. New England Journal of Medicine. 2007, 357: 11991209. 10.1056/NEJMoa073491.PubMed CentralView ArticlePubMedGoogle Scholar
 Begovich AB, Carlton VE, Honigberg LA, Schrodi SJ, Chokkalingam AP, Alexander HC, Ardlie KG, Huang Q, Smith AM, Spoerke JM, et al: A missense singlenucleotide polymorphism in a gene encoding a protein tyrosine phosphatase (PTPN22) is associated with rheumatoid arthritis. Am J Hum Genet. 2004, 75: 330337. 10.1086/422827.PubMed CentralView ArticlePubMedGoogle Scholar
 Carlton VEH, Hu XL, Chokkalingam AP, Schrodi SJ, Brandon R, Alexander HC, Chang M, Catanese JJ, Leong DU, Ardlie KG, et al: PTPN22 genetic variation: Evidence for multiple variants associated with rheumatoid arthritis. American Journal of Human Genetics. 2005, 77: 567581. 10.1086/468189.PubMed CentralView ArticlePubMedGoogle Scholar
 Kallberg H, Padyukov L, Plenge RM, Ronnelid J, Gregersen PK, Helmvan Mil van der AHM, Toes REM, Huizinga TW, Klareskog L, Alfredsson L, et al: Genegene and geneenvironment interactions involving HLADRB1, PTPN22, and smoking in two subsets of rheumatoid arthritis. American Journal of Human Genetics. 2007, 80: 867875. 10.1086/516736.PubMed CentralView ArticlePubMedGoogle Scholar
 Plenge RM, Padyukov L, Remmers EF, Purcell S, Lee AT, Karlson EW, Wolfe F, Kastner DL, Alfredsson L, Altshuler D, et al: Replication of putative candidategene associations with rheumatoid arthritis in > 4,000 samples from North America and Sweden: Association of susceptibility with PTPN22, CTLA4, and PADI4. American Journal of Human Genetics. 2005, 77: 10441060. 10.1086/498651.PubMed CentralView ArticlePubMedGoogle Scholar
 Zhang D, Shao C, Shao M, Yan P, Wang Y, Liu Y, Liu W, Lin T, Xie Y, Zhao Y, et al: Effect of muopioid receptor gene polymorphisms on heroininduced subjective responses in a Chinese population. Biol Psychiatry. 2007, 61: 12441251. 10.1016/j.biopsych.2006.07.012.View ArticlePubMedGoogle Scholar
 Carpenter J: Test Inversion Bootstrap Confidence Intervals. Journal of the Royal Statistical Society Series B (Statistical Methodology). 1999, 61: 159172. 10.1111/14679868.00169.View ArticleGoogle Scholar
 Davison AC, Hinkley DV, Young GA: Recent developments in bootstrap methodology. Statistical Science. 2003, 18: 141157. 10.1214/ss/1063994969.View ArticleGoogle Scholar
 DiCiccio TJ, Efron B: Bootstrap confidence intervals. Statistical Science. 1996, 11: 189212. 10.1214/ss/1032280214.View ArticleGoogle Scholar
 Efron B: Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics. 1979, 7: 126. 10.1214/aos/1176344552.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.