Choice of population structure informative principal components for adjustment in a case-control study
© Peloso and Lunetta; licensee BioMed Central Ltd. 2011
Received: 24 March 2011
Accepted: 19 July 2011
Published: 19 July 2011
There are many ways to perform adjustment for population structure. It remains unclear what the optimal approach is and whether the optimal approach varies by the type of samples and substructure present. The simplest and most straightforward approach is to adjust for the continuous principal components (PCs) that capture ancestry. Through simulation, we explored the issue of which ancestry informative PCs should be adjusted for in an association model to control for the confounding nature of population structure while maintaining maximum power. A thorough examination of selecting PCs for adjustment in a case-control study across the possible structure scenarios that could occur in a genome-wide association study has not been previously reported.
We found that when the SNP and phenotype frequencies do not vary over the sub-populations, all methods of selection provided similar power and appropriate Type I error for association. When the SNP is not structured and the phenotype has large structure, then selection methods that do not select PCs for inclusion as covariates generally provide the most power. When there is a structured SNP and a non-structured phenotype, selection methods that include PCs in the model have greater power. When both the SNP and the phenotype are structured, all methods of selection have similar power.
Standard practice is to include a fixed number of PCs in genome-wide association studies. Based on our findings, we conclude that if power is not a concern, then selecting the same set of top PCs for adjustment for all SNPs in logistic regression is a strategy that achieves appropriate Type I error. However, standard practice is not optimal in all scenarios and to optimize power for structured SNPs in the presence of unstructured phenotypes, PCs that are associated with the tested SNP should be included in the logistic model.
The principal components (PCs) of genome-wide genotype data can be used to detect and adjust for population structure in genetic association analyses [1, 2]. The popularity of the PC method is evident by its wide use: it has been cited by over 400 publications. However, the choice of which PCs to use and the best way to adjust for the PCs in analyses of dichotomous traits is not yet clear.
Methods of ancestry informative PC selection.
Scenarios of population structure that could occur across the genome.
(K1 = K2)
(p 1 ≠ p 2 )
(p 1 = p 2 )
No PC adjustment (None)
10% Rule (10% Rule)
PCs significantly related to the outcome at significance level α = 0.001, 0.01, or 0.05 (Sig001, Sig01, Sig 05, respectively)
PCs significantly related to the SNP at α = 0.05 or 0.01 (SNP01, SNP05, respectively)
PCs significant according to the Tracy-Widom statistic at α = 0.05 (TW)
Top PCs (2 or 10) determined according to eigenvalue (Top2, Top10, respectively)
Simulated true population i.e, Gold standard (Pop)
We tested for association with the simulated case/control outcome using logistic regression and compared Type I error and power of associations between the outcome and SNP when adjusting for selected principal components of ancestry. Finally, to provide a practical example of the methods of PC selection, we performed all methods of PC selection using dichotomized height data from the Framingham Heart Study.
We simulated independent genome-wide SNPs by generating ancestral population allele frequencies for 10,000 SNPs (pj, j = 1,...,10000) from a uniform (0.05, 0.50)-distribution. We then created two sub-populations (i = 1,2) of 500 individuals, each descending from the ancestral population according to Fst. We simulated the allele frequencies pij (i = 1,2; j = 1,..., 10000) in the two sub-populations according to a beta distribution :
Fst is a measure of population differentiation ; Fst = 0.01 is representative of human population structure seen within continents, while Fst = 0.1 is representative of structure seen between continents [6, 7]. We simulated our samples using Fst values of 0.01 and 0.1. 1,000 replicates of independent genome-wide SNP data were generated for 2 populations of 500 individuals.
Population differentiation (Fst)
Population prevalence of disease (K)
Frequency of disease in sub population 1 (K1)
0.10 -- 0.19
Number of cases in sub-population 1
250 - 475
Overall risk allele frequency (p)
Risk allele freq in sub-population 1 (p1)
0.10 -- 0.30
Odds Ratio (log additive model)
1.0, 1.2, 1.5
Relationship between number of cases and population prevalence of disease in each sub-population.
# of Cases in Population 1
# of Cases in Population 2
Fold increase in # of cases
We compared the methods for selecting PCs for adjustment described above. We used logistic regression to test the association between the test SNP and case status adjusting for latent ancestry defined by the PCs. We compared the proportion of replicates significant at α = 0.05, using 1,000 replicates for each set of parameters to investigate Type I error and power. 1,000 replicates for Type I error provides a 95% confidence interval of 0.036 to 0.064 around a nominal significance level of 0.05.
To determine the effects of the PC selection methods when a sample is composed of a more complex structure, we simulated two populations that each diverged with as Fst of 0.01 from an ancestral population, as previously. We treated these two subpopulations as ancestral populations and then simulated two subpopulations diverging from each of the ancestral populations, again with an Fst of 0.01. The resulting sample had four sub-populations. Due to computational limitations, a single replicate of independent genome-wide SNP data was generated for this scenario for PCA. As before, 1,000 replicates were used to evaluate Type I error and power, simulating the genotype and phenotype (conditional on genotype for power) for each replicate. We varied the phenotypic and genotypic structure of the sub-populations, from having no structure to more extreme structure.
Results and Discussion
We next expanded this simulation to larger differences in allele frequencies between the two sub-populations. With large Fst (0.10), we expect many SNPs with greater than a 0.2 difference in allele frequency between sub-populations, and thus we believe it may be more common to observe large allele frequency differences between populations than large phenotypic differences between populations. We found that as the difference in the risk allele frequency between the two sub-populations increases, the difference in power between adjusting and not adjusting for PCs becomes greater (see additional file 1) when a non-structured phenotype and structured SNP were tested for association. Consistent with our previous results, we found when both the phenotype and the SNP are structured any method of selection was adequate. When the SNP is non-structured (p1 = p2 = 0.5) and the phenotype has large structure (phenotypic ratio = 4), we found slightly higher power when selecting PCs by the 10% rule compared to not selected any PCs, contrary to the findings presented in Figure 2. As seen in additional file 1, the difference in Type I error between using the 10% rule method versus no PC selection method is similar to the difference in power. The difference between the two analyses is that PC1 is included in 16% of the replicates for the 10% rule.
Finally, we increased the sample size for the genome-wide SNPs simulation to 5,000 individuals from each sub-population. Increasing the sample size allowed us to determine if our observed results were affected by the simulated sample size of 500 cases and 500 controls. We found the same patterns with the larger sample size as we did when we used the 500 individuals from each sub-population (results not shown).
In general, we observed similar patterns when the data consisted of four sub-populations as with the two sub-populations scenario already presented (see additional file 2). When the SNP and phenotype frequencies do not vary over the sub-populations, all methods of selection provided similar power and appropriate Type I error for association. When the SNP is not structured and the phenotype has large structure, then selection methods that do not select PCs for inclusion as covariates provide the most power. Likewise, when there is a structured SNP and a non-structured phenotype, selection methods that include PCs in the model have greater power. When both the SNP and the phenotype are structured, all methods of selection have similar power, except when the SNP differs between the 4 sub-populations and only the top 2 PCs are selected for adjustment. In this case, we observe elevated Type I error and loss of power because three PCs are required to distinguish 4 subpopulations, and only 2 PCs are included in the model. The loss of power when we do not adjust fully for the population structure is due to an attenuation of the effect due to the population structure (negative confounding). Positive confounding occurs when the structure is in the same direction as the true genetic effect, or that the phenotypic means and risk allele frequencies are positively correlated. Negative confounding occurs when the structure is in the opposite direction as the true genetic effect, or in other words, that the phenotypic means and risk allele frequencies are negatively correlated . While we only report results here based on negative confounding, simulations with positive confounding yielded similar conclusions (see additional file 3).
Scenarios that could occur across the genome with the optimal method of selection.
(K1 ≠ K2)
(K1 = K2)
(p 1 ≠ p 2 )
Any method of selection except no PC adjustment
Selecting a fixed number of PCs or PCs associated with the SNP
(p 1 = p 2 )
Selecting PCs associated with the SNP (α = 0.01), 10% Rule, no PC adjustment
Any method of selection
Example of Principal Component Selection Criteria with Height
Average adult height is taller in northern Europe than in southern Europe. By our definition, height is a structured phenotype, i.e., it varies by ancestry. Lactose intolerance also varies across Europe from North to South. The genetic polymorphism in the LCT (Lactase) gene that causes lactose intolerance, and the SNPs in LD with this polymorphism, appears to be associated with height in non-homogeneous samples of individuals of European descent . Observing an LCT-height association in a sample indicates the sample has population structure. As an example of using the various methods of PC selection in practice, we investigated the association between height and four SNPs in the Framingham Heart Study, adjusting for selected PCs. Because our interest is in dichotomous outcomes, we dichotomized height by the median for this example, and used logistic regression to test for association between four SNPs and dichotomized height:
rs1042725 and rs6060369 : Two positive control SNP not in the LCT gene that are known to be associated with height. Both SNPs are associated with PC1 with p-values of 9.67E-08 and 0.0003, respectively.
rs2322659 : A structured SNP in the lactase gene which is known to vary in frequency among European Americans. This SNP is highly associated with PC1 (p-value = 3.8E-73).
rs2290305: a non-structured SNP, not associated with PC1 (p-value = 0.425).
Height association results with methods for selecting PCs.
beta (p-value) of SNP
Positive Control SNPs
Method for selecting PCs
No PC Adjustment
-0.406 (< 0.001)
Top 2 PCs
0.336 (< 0.001)
Top 10 PCs
PC1 - PC10
0.322 (< 0.001)
PC1 - PC81
Associated with the outcome at α = 0.05
PC1, PC2, PC4, PC8, PC21, PC25, PC28, PC39, PC47, PC49, PC56, PC64, PC77
Associated with the outcome at α = 0.01
PC1, PC4, PC21, PC25, PC28, PC49, PC77
0.322 (< 0.001)
Associated with the outcome at α = 0.001
PC1, PC4, PC28
0.322 (< 0.001)
Associated with the SNP at α = 0.05
varied by SNP
0.333 (< 0.001)
Associated with the SNP at α = 0.01
varied by SNP
0.313 (< 0.001)
0.333 (< 0.001)
-0.406 (< 0.001)
We performed a simulation study in which we generated multiple sets of genome-wide SNPs. The goal was to investigate Type I error and power of associations between case-control status and a SNP when adjusting for ancestry informative PCs selected by a variety of rules. A second aim of this study was to examine more critically the effects of the amount of phenotypic structure and genotypic structure on the association analysis, as well as investigate the bias and precision of the associations.
We did not specifically address the issue of which SNPs to include in the PCA. Using all available SNPs in a PCA provides the maximal information to ancestry, but highly correlated SNPs or unusual chromosomal phenomena such as known inversion polymorphisms or genomic regions known to play a role in susceptibility to a disease can affect the results from a PCA . Under some conditions, including the chromosomal regions with high influence in PCA may have a negative impact on power when PCs are used to adjust for ancestry. For example, if the region harbours a true genetic effect, the effect may be adjusted away. Thus, some researchers have recommended not using PCs that are correlated with localized chromosomal regions .
All simulations were performed using distinct sub-populations. Admixed individuals are commonly used in GWAS. While we did not explicitly simulate admixed individuals, we know based on previous work  that PCA to detect ancestry and subsequent adjustment works similarly with admixed individuals having global phenotypic structure. On the other hand, PCs based on genome-wide data do not adequately capture local ancestry or local phenotypic structure [16, 17]. If local phenotypic structure exists, other techniques need to be applied to capture and adjust for local ancestry such as PCA in the region of the test SNP, or methods that estimate local ancestry proportions such as ANCESTRYMAP  or LAMP . Further work needs to be done to determine how to adjust for local ancestry without adjusting away true effects.
We focused our exploration on linear PC adjustment models. We did not investigate adjusting for clusters identified in the individual genotype data because previous work has suggested that linear adjustments are adequate for the population structure typical of European populations . We did not investigate the method of testing the reduction in the inflation of the genomic control lambda  or PC-Finder  due to the computational burden of these algorithms. Prior to excluding these methods, we investigated the run time for the PC-Finder algorithm and found that it is related to the number of PCs selected. When more PCs are needed to adjust for the structure, the algorithm takes longer to run. With simulated genotypes similar to those presented in this chapter, PC-Finder requires between 5 minutes and one hour to select PCs. While these methods may be feasible in individual data sets where the algorithm needs to be run only once for the outcome of interest, it is not suited for simulations requiring thousands of replicates. Furthermore, we found in our dichotomized height example that PC-Finder did not select any PCs for adjustment and therefore did not remove the false association with the LCT SNP.
Our findings suggest that to optimize power under certain scenarios, the choice of covariate PCs in a genome-wide association study using logistic regression with a dichotomous outcome should be SNP-dependent. Our findings only apply to case-control or dichotomous outcome analyses using logistic regression. These results may appear to conflict with Xing and Xing , who recently clarified that covariate adjustment in logistic regression always leads to a loss of precision, but not always a loss of power. They conclude that when the genotype and covariates are independent, it is still more efficient to adjust for the predictive covariates. In contrast, we found that with large phenotypic structure and non-structured genotype, it is not more efficient to adjust for ancestry informative PCs. This is due to the very large (odds ratio > 10) association between the population and outcome and the exposure with a frequency of 50% (i.e., half the sample is in population 1 and the other is in population 2). When we extended Xing and Xing's simulations to larger exposure odds ratios with the exposure frequency of 50%, we obtained results consistent with our findings above. Whether to adjust for covariates depends on the complex relationships between the outcome, the covariate, and the genotype. When the phenotype does not have substantial structure, we obtain similar power when adjusting for population structure as when not adjusting for population structure. In their investigation, Xing and Xing limited to the situation where the covariate and genotype are independent. In our work, the SNP of interest and the ancestry informative PCs may not be independent. The SNP genotype is included in the linear combination of the genotypes that define the PCs; structured SNPs contribute a higher weight to some PCs. We found that with structured SNPs and non-structured phenotypes, it is more efficient to adjust for PCs.
For linear regression using continuous phenotypes, one can check phenotypes for association with the PCs. If a top PC is significantly associated with the phenotype of interest then the trait-genotype association model should include PCs as covariates to adjust for population structure. Unlike logistic regression, adjusting for covariates associated with the trait in linear regression always improves the precision of the effect estimate by reducing the residual variance . Since the PCs are orthogonal, a single model regressing the top PCs on the outcome can be used to determine if the PCs are associated with the outcome. Associated PCs should be included as covariates in genome-wide association studies (GWAS).
Standard practice is to include a fixed number of PCs in association models for GWAS. Here, we conclude that if power is not a concern, then selecting the same set of PCs for adjustment for all SNPs in logistic regression is a strategy that achieves appropriate Type I error. However, standard practice is not optimal in all scenarios and to optimize power for structured SNPs in the presence of unstructured phenotypes, PCs that are associated with the tested SNP should be included in the logistic model. The gain in power we observed in our simulations was an approximate 5%-percentage point increase for adjusting only when the SNP is structured over always adjusting for the ancestry informative PCs. We note that some of the differences in power may disappear if we correct for Type I error, but this is not done in practice. It may be easier and more intuitive to adjusting for the same set of PCs across all SNP associations.
A portion of this research was conducted using the Linux Clusters for Genetic Analysis (LinGA) computing resource funded by the Robert Dawson Evans Endowment of the Department of Medicine at Boston University School of Medicine and Boston Medical Center and contributions from individual investigators.
- Patterson N, Price AL, Reich D: Population structure and eigenanalysis. PLoS Genet. 2006, 2 (12): e190-10.1371/journal.pgen.0020190.PubMed CentralView ArticlePubMed
- Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D: Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006, 38 (8): 904-909. 10.1038/ng1847.View ArticlePubMed
- Kimmel G, Jordan MI, Halperin E, Shamir R, Karp RM: A randomization test for controlling population stratification in whole-genome association studies. Am J Hum Genet. 2007, 81 (5): 895-905. 10.1086/521372.PubMed CentralView ArticlePubMed
- Yu K, Wang Z, Li Q, Wacholder S, Hunter DJ, Hoover RN, Chanock S, Thomas G: Population substructure and control selection in genome-wide association studies. PLoS ONE. 2008, 3 (7): e2551-10.1371/journal.pone.0002551.PubMed CentralView ArticlePubMed
- Balding DJ, Nichols RA: DNA profile match probability calculation: how to allow for population stratification, relatedness, database selection and single bands. Forensic Sci Int. 1994, 64 (2-3): 125-140. 10.1016/0379-0738(94)90222-4.View ArticlePubMed
- Marchini J, Cardon LR, Phillips MS, Donnelly P: The effects of human population structure on large genetic association studies. Nat Genet. 2004, 36 (5): 512-517. 10.1038/ng1337.View ArticlePubMed
- Price AL, Butler J, Patterson N, Capelli C, Pascali VL, Scarnicci F, Ruiz-Linares A, Groop L, Saetta AA, Korkolopoulou P: Discerning the ancestry of European Americans in genetic association studies. PLoS Genet. 2008, 4 (1): e236-10.1371/journal.pgen.0030236.PubMed CentralView ArticlePubMed
- Ding X, Weiss S, Raby B, Lange C, Laird NM: Impact of population stratification on family-based association tests with longitudinal measurements. Stat Appl Genet Mol Biol. 2009, 8 (1): Article 17-PubMed
- Robinson L, Jewell NP: Some Surprising Results about Covariate Adjustment in Logistic Regression Models. International Statistical Review. 1991, 59 (2): 227-240. 10.2307/1403444.View Article
- Campbell CD, Ogburn EL, Lunetta KL, Lyon HN, Freedman ML, Groop LC, Altshuler D, Ardlie KG, Hirschhorn JN: Demonstrating stratification in a European American population. Nat Genet. 2005, 37 (8): 868-872. 10.1038/ng1607.View ArticlePubMed
- Lettre G, Jackson AU, Gieger C, Schumacher FR, Berndt SI, Sanna S, Eyheramendy S, Voight BF, Butler JL, Guiducci C: Identification of ten loci associated with height highlights new biological pathways in human growth. Nat Genet. 2008, 40 (5): 584-591. 10.1038/ng.125.PubMed CentralView ArticlePubMed
- Zhang L, Li J, Pei YF, Liu Y, Deng HW: Tests of association for quantitative traits in nuclear families using principal components to correct for population stratification. Ann Hum Genet. 2009, 73 (Pt 6): 601-613.PubMed CentralView ArticlePubMed
- Zou F, Lee S, Knowles MR, Wright FA: Quantification of population structure using correlated SNPs by shrinkage principal components. Hum Hered. 2010, 70 (1): 9-22. 10.1159/000288706.PubMed CentralView ArticlePubMed
- Laurie CC, Doheny KF, Mirel DB, Pugh EW, Bierut LJ, Bhangale T, Boehm F, Caporaso NE, Cornelis MC, Edenberg HJ: Quality control and quality assurance in genotypic data for genome-wide association studies. Genet Epidemiol. 2010, 34 (6): 591-602. 10.1002/gepi.20516.PubMed CentralView ArticlePubMed
- Zhu X, Li S, Cooper RS, Elston RC: A unified association analysis approach for family and unrelated samples correcting for stratification. Am J Hum Genet. 2008, 82 (2): 352-365. 10.1016/j.ajhg.2007.10.009.PubMed CentralView ArticlePubMed
- Qin H, Morris N, Kang SJ, Li M, Tayo B, Lyon H, Hirschhorn J, Cooper RS, Zhu X: Interrogating local population structure for fine mapping in genome-wide association studies. Bioinformatics. 2010, 26 (23): 2961-2968. 10.1093/bioinformatics/btq560.PubMed CentralView ArticlePubMed
- Wang X, Zhu X, Qin H, Cooper RS, Ewens WJ, Li C, Li M: Adjustment for local ancestry in genetic association analysis of admixed populations. Bioinformatics. 2011, 27 (5): 670-677. 10.1093/bioinformatics/btq709.PubMed CentralView ArticlePubMed
- Patterson N, Hattangadi N, Lane B, Lohmueller KE, Hafler DA, Oksenberg JR, Hauser SL, Smith MW, O'Brien SJ, Altshuler D: Methods for high-density admixture mapping of disease genes. Am J Hum Genet. 2004, 74 (5): 979-1000. 10.1086/420871.PubMed CentralView ArticlePubMed
- Sankararaman S, Sridhar S, Kimmel G, Halperin E: Estimating local ancestry in admixed populations. Am J Hum Genet. 2008, 82 (2): 290-303. 10.1016/j.ajhg.2007.09.022.PubMed CentralView ArticlePubMed
- Peloso GM, Timofeev N, Lunetta KL: Principal-component-based population structure adjustment in the North American Rheumatoid Arthritis Consortium data: impact of single-nucleotide polymorphism set and analysis method. BMC Proc. 2009, 3 (Suppl 7): S108-10.1186/1753-6561-3-s7-s108.PubMed CentralView ArticlePubMed
- Li Q, Wacholder S, Hunter DJ, Hoover RN, Chanock S, Thomas G, Yu K: Genetic background comparison using distance-based regression, with applications in population stratification evaluation and adjustment. Genet Epidemiol. 2009, 33 (5): 432-441. 10.1002/gepi.20396.PubMed CentralView ArticlePubMed
- Xing G, Xing C: Adjusting for covariates in logistic regression models. Genet Epidemiol. 2010, 34 (7): 769-771. 10.1002/gepi.20526. author reply 772PubMed CentralView ArticlePubMed
- Novembre J, Stephens M: Interpreting principal component analyses of spatial population genetic variation. Nat Genet. 2008, 40 (5): 646-649. 10.1038/ng.139.PubMed CentralView ArticlePubMed
- Jewell NP: Statistics for Epidemiology. 2004, Chapman & Hall/CRC, I:
- Li Q, Yu K: Improved correction for population stratification in genome-wide association studies by identifying hidden population structures. Genet Epidemiol. 2008, 32 (3): 215-226. 10.1002/gepi.20296.View ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.