Volume 6 Supplement 1
Genetic Analysis Workshop 14: Microsatellite and singlenucleotide polymorphism
Effects of population structure on genetic association studies
 Hongyan Xu^{1} and
 Sanjay Shete^{1}Email author
DOI: 10.1186/147121566S1S109
© Xu and Shete; licensee BioMed Central Ltd 2005
Published: 30 December 2005
Abstract
Populationbased casecontrol association is a promising approach for unravelling the genetic basis of complex diseases. One potential problem of this approach is the presence of population structure in the samples. Using the Collaborative Study on the Genetics of Alcoholism (COGA) singlenucleotide polymorphism (SNP) datasets, we addressed three questions: How can the degree of population structure be quantified, and how does the population structure affect association studies? How accurate and efficient is the genomic control method in correcting for population structure? The amount of population structure in the COGA SNP data was found to inflate the pvalue in association tests. Genomic control was found to be effective only when the appropriate number of markers was used in the control group in order to correctly calibrate the test. The approach presented in this paper could be used to select the appropriate number of markers for use in the genomic control method of correcting population structure.
Background
Unraveling the genetic basis of psychiatric diseases such as alcoholism is becoming the major challenge and focus of genetic studies, and largescale casecontrol association studies at the genomic level are a promising approach. One potential problem for association studies is the presence of population structure in the samples, which raises the potential for confounding and spurious results. For example, if the samples come from several subpopulations with different allele frequencies, and if the proportions of cases and controls sampled from each subpopulation are not matched, differences in allele frequencies between cases and controls will appear, mimicking a statistical signal of association and leading to falsepositive results. However, there has been much debate over how much population structure exists and how serious a problem it poses to association studies. With the advances in genotyping techniques, association studies can now be carried out at the genomic level using thousands of genetic markers. There have been few studies of the effects of population structure on association studies using such data. Recently, Marchini et al. performed such a study [1]; however, their results were based on simulated samples using a Bayesian model extrapolated from a very limited dataset. Even though the Bayesian model fit their data quite well, it would be of interest to compare their results with those from a study that uses a large set of real data. Therefore, we used the Collaborative Study on the Genetics of Alcoholism (COGA) data from Genetic Analysis Workshop 14 (GAW14) to assess the effects of population structure on largescale association studies. Three questions were addressed by our study. How can the degree of population structure be quantified, and how does the population structure affect association studies? How accurate and efficient is the genomic control method for correcting for population structure?
Methods
Data
The COGA singlenucleotide polymorphism (SNP) data from GAW14 was used in this study. The data consisted of two sets of SNP genotype data, one from Affymetrix, the other from Illumina. The datasets contained individuals from 143 extended families. A total of 304 unrelated individuals were selected from these families, including the founders and marryins from each family. When the genotypes of both founders were not available, one of the children of the founders was randomly selected. Among these 304 unrelated individuals, 265 were White, 30 were Black, and 9 were others. The Affymetrix dataset contained genotypes at 10,810 SNP markers, while the Illumina dataset contained genotypes at 4,596 markers.
Quantifying population structure
To quantify the population structure, we used the statistic F_{ ST }, which measures variation in allele frequency between populations. We used the unbiased estimator of F_{ ST }at a biallelic SNP described by Weir and Cockerham [2] (see also Weir [3]). Specifically, suppose samples are drawn from S populations and there are two alleles, A and a, at any given SNP. Let the frequency of allele A in the i^{th} population be p_{ i }, the average allele frequency across populations be , and the sample size from the i^{th} population be n_{ i }. Then the observed mean square errors of allele frequency within a population, denoted as MSI, was computed as
and the observed mean square errors of allele frequency between populations, denoted as MSP, was computed as,
F_{ ST }can then be estimated as,
where n_{ c }is the average sample size across populations. To correct for the different sample sizes from each population, and is given by
It is possible for this unbiased estimator to result in values below zero; therefore, because F_{ ST }must be a value between 0 and 1, F_{ ST }is set to 0 in such situations. We estimated values of F_{ ST }for each SNP in our sample of 304 individuals.
Measuring association at an SNP locus
In order to quantify the effects of population structure on tests for association in our sample, we randomly assigned case/control status while keeping the population structure observed in the total sample. We randomly chose 152 Whites to be cases, and the remaining individuals were assigned to be controls. For each SNP locus, 1,000 such random assignments were performed. The association for each assignment was measured using Armitage's trend test under an additive genetic model [4], which has been shown to be robust against deviations of genotype frequencies from HardyWeinberg equilibrium [5]. Suppose the two alleles at an SNP locus are denoted as A and a, then this test statistic is given by
where N is the total sample size, R is the number of cases, n_{1} and n_{2} are the number of individuals with genotypes Aa and AA in the sample, respectively, and r_{1} and r_{2} are the number of cases with genotypes Aa and AA, respectively. Under the null hypothesis of no association, and assuming no population structure, Y^{2} should follow asymptotically a χ^{2} distribution with 1 degree of freedom. For each locus, 1,000 random assignments were performed, and Y^{2} was computed for each assignment. Thus, using the Affymetric data set we generated 10,810,000 samples from the distribution of Y^{2} with the specific level of population structure observed in the COGA data. For the Illumina data set, we generated 4,596,000 such samples. The empirical distribution of the test statistic Y^{2} was compared with the distribution to study the effects of population structure under the null hypothesis.
Genomic control
Recently, several statistical methods have been proposed to perform association studies in the presence of population structure. One popular method is genomic control, in which a set of unlinked markers is selected to correct for population structure [6]. The idea is that population structure inflates the test statistic Y^{2} by some constant value, λ, which can be estimated by the L unlinked markers in the genomic control group. Here we examined the performance of genomic control with a large sample size and small nominal pvalue from the real COGA dataset. The L markers were chosen randomly from all the markers in each dataset by assuming a uniform distribution over the markers, with the constraint that the genetic distance between neighboring markers was greater than 1 cM. This selection strategy leads to loci in genomic control that are unlikely to be correlated [7]. As in the study by Bacanu et al. [8], a robust estimator of λ was used, which is given by . As in another recent study [1], any estimate of λ below 1 was set to 1.
Results
The mean of F_{ ST }in the Affymetrix data set was 0.085 with a variance of 0.013. The mean of F_{ ST }in the Illumina data set was 0.070 with a variance of 0.006. These results indicate that there is a substantial amount of population structure in our samples. The results are similar to values reported in major human races [9].
Discussion
Figure 1 indicates that the amount of population structure in our sample of unrelated individuals drawn from the COGA families could well inflate the pvalues of genetic association studies. This result is the opposite of that obtained by Marchini et al. [1], as illustrated in their Figure 3, where population structure was shown to decrease pvalues. The discrepancy could be due to the small sample sizes used in their study.
The results in Figure 1 also indicate that as the nominal pvalue decreases, the problem posed by a given amount of population structure becomes more and more serious. This could have important implications for genetic studies in which very large numbers of markers are used. Owing to the thousands of markers tested in such studies, correcting for multiple testing would mean that any "significant" result would have a much lower pvalue in order for the association results to be considered "significant." Usually, the genomewide significance level is set in the very low range of 10^{4} to 10^{8} [10]. Our results indicate that in this range, the problem posed by population structure becomes very serious indeed. As a consequence, the effects of population structure cannot be safely ignored for genomewide association studies, and steps must be taken to correct for the effects of population structure.
One popular method for correcting the effects of population structure is the genomic control approach, wherein several unlinked markers are genotyped to correct for the observed level of population structure in the sample at hand. The performance of genomic control was assessed in our samples for a variable number of independent markers. The performance of genomic control does vary, depending on the number of markers L examined in the genomic control group. When L is small (e.g., L = 50), the correction is incomplete, resulting in a lax test and falsepositive results. When L is large, there is overcorrection, resulting in a conservative test, which would lead to missing real signals. The conclusion by Marchini et al. [1] that "If enough loci are used, then the test will typically be approximately calibrated" does not seem to be true according to our analysis. Therefore, choosing the appropriate L becomes critical for correctly calibrating tests for association. The exact reason for this result is unclear, and further validation is required. Simulation studies have suggested that linkage disequilibrium is not likely to extend beyond 5 kb, even in relatively isolated populations [7]. Since in our study, the genetic distance between neighboring markers was at least 1 cM, using the approximation 1 cM = 1 Mb, it is unlikely that correlation between markers could be the reason. Therefore, one possible explanation is that in our dataset, there could be much variation in λ across the genome, which is not accounted for in the estimation of λ. Following the procedure proposed in this report, a grid search could be performed for the appropriate L at a specific level of significance.
Conclusion
Through our analysis based on real datasets, we have shown that population structure inflates the pvalues in genetic association studies, especially in cases of very small pvalues. Therefore, the effects of population structure cannot be safely ignored in largescale association studies at the genomic level, where the pvalue is usually required to be very small in order to achieve statistical significance. Genomic control is an effective way to correct for the effects of population structure, but only when the appropriate number of markers is used. The approach proposed in this paper could be used to select the appropriate number of markers. However, caution must be taken because the exact underlying reason for varying the number of loci in genomic controls may be dependant on several other factors that were not considered here.
Abbreviations
 COGA:

Collaborative Study of the Genetics on Alcoholism
 GAW14:

Genetic Analysis Workshop 14
 MSI:

Mean square errors of allele frequency within a population
 MSP:

Mean square errors of allele frequency between a population
 SNP:

Singlenucleotide polymorphism
Declarations
Acknowledgements
The study was supported in part by the "Chief" Dauphin Memorial Postdoctoral Fellowship Fund from The University of Texas M. D. Anderson Cancer Center. We thank two reviewers for several helpful suggestions.
Authors’ Affiliations
References
 Marchini J, Cardon LR, Phillips MS, Donnelly P: The effects of human population structure on large genetic association studies. Nat Genet. 2004, 36: 512517. 10.1038/ng1337.View ArticlePubMedGoogle Scholar
 Weir BS, Cockerham CC: Estimating Fstatistics for the analysis of population structure. Evolution. 1984, 38: 13581370. 10.2307/2408641.View ArticleGoogle Scholar
 Weir BS: Genetic Data Analysis II. 1996, Sunderland, MA: Sinauer AssociatesGoogle Scholar
 Armitage P: Tests for linear trends in proportions and frequencies. Biometrics. 1955, 11: 375386. 10.2307/3001775.View ArticleGoogle Scholar
 Sasieni PD: From genotypes to genes: doubling the sample size. Biometrics. 1997, 53: 12531261. 10.2307/2533494.View ArticlePubMedGoogle Scholar
 Devlin B, Roeder K: Genomic control for association studies. Biometrics. 1999, 55: 9971004. 10.1111/j.0006341X.1999.00997.x.View ArticlePubMedGoogle Scholar
 Kruglyak L: Prospects for wholegenome linkage disequilibrium mapping of common disease genes. Nat Genet. 1999, 22: 139144. 10.1038/9642.View ArticlePubMedGoogle Scholar
 Bacanu SA, Devlin B, Roeder K: The power of genomic control. Am J Hum Genet. 2000, 66: 19331944. 10.1086/302929.PubMed CentralView ArticlePubMedGoogle Scholar
 Nei M: Molecular Population Genetics and Evolution. 1975, New York: American ElsevierGoogle Scholar
 Risch NJ: Searching for genetic determinants in the new millennium. Nature. 2000, 405: 847856. 10.1038/35015718.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.