Volume 6 Supplement 1
The efficacy of short tandem repeat polymorphisms versus single-nucleotide polymorphisms for resolving population structure
© Kauwe et al; licensee BioMed Central Ltd 2005
Published: 30 December 2005
Accurately resolving population structure in a sample is important for both linkage and association studies. In this study we investigated the power of single-nucleotide polymorphisms (SNPs) in detecting population structure in a sample of 286 unrelated individuals. We varied the number of SNPs to determine how many are required to approach the degree of resolution obtained with the Collaborative Study on the Genetics of Alcoholism (COGA) short tandem repeat polymorphisms (STRPs). In addition, we selected SNPs with varying minor allele frequencies (MAFs) to determine whether low or high frequency SNPs are more efficient in resolving population structure. We conclude that a set of at least 100 evenly spaced SNPs with MAFs of 40–50% is required to resolve population structure in this dataset. If SNPs with lower MAFs are used, then more than 250 SNPs may be required to obtain reliable results.
Accurately resolving population structure in a sample is important for both linkage and association studies. Understanding population structure can allow us to use homogenous study groups, thus improving our ability to detect population specific linkage and ensuring that false linkage is not detected because of erroneously assigned allele frequencies. In association studies, differences in population structure between cases and controls can result in high rates of both type I and type II errors [e.g., [1–3]]. When population structure can be resolved, matching between cases and controls can be achieved and one possible confounding factor can be removed.
Short tandem repeat polymorphisms (STRPs) are highly variable markers that have proven to be very useful in resolving population structure. However, single-nucleotide polymorphism (SNP) assays are efficient and inexpensive, and the use of SNPs has become widespread. The resolving power of a set of SNPs will depend upon both the density of the markers and their frequencies. SNPs with minor allele frequencies (MAFs) of near 0.50 are assumed to be more ancient, while SNPs with low MAFs are assumed to be much more recent . One hypothesis is that those SNPs with high MAFs predate the origins of modern human races and carry little useful information about population structure. It follows that SNPs with low MAFs, being much more recent polymorphisms, may be more informative in resolving population structure. Alternatively, the low heterozygosity of these SNPs may limit their usefulness (since the allele frequency differences between two populations would perforce be low); in this case, SNPs with high MAFs will be far more informative.
In this study we investigate the power of SNPs in detecting population substructure using results from STRPs as the gold standard. We first investigate the number of SNPs that are required to obtain results comparable to those of STRPs. Second, we determine whether lower or higher frequency SNPs provide more information regarding population structure.
This study includes 286 unrelated individuals from the Collaborative Study on the Genetics of Alcoholism (COGA) pedigrees . We preferentially selected founders; however, in cases where 2 founders had large amounts of missing data, we selected the offspring for whom the most data was available. The self-reported race of these 286 individuals was as follows: 245 European Americans, 26 African Americans, 12 European American/Hispanics, and 3 African American/Hispanics. Each of these individuals was genotyped for the 328 STRPs from COGA, 4,720 SNPs from the Illumina linkage panel, and 11,120 SNPs from the Affymetrix mapping array which were prepared for Genetic Analysis Workshop 14 .
A description of MAFs and rates of missing data in the 1,000 SNP sets
Illumina Low Frequency
Illumina High Frequency
Affymetrix Low Frequency
Affymetrix High Frequency
Each of these groups was then analyzed with the computer program STRUCTURE to identify possible sub-structure in the sample of 286 unrelated individuals . This method assumes that the sample contains a mixture of subpopulations and that within each subpopulation there is Hardy-Weinberg and linkage equilibrium between markers. This program identifies subpopulations of individuals who are genetically similar through a Markov chain Monte Carlo (MCMC) sampling procedure . The STRP and SNP sets were run for 10,000 iterations after 10,000 burn-in replicates assuming a two-cluster solution. Although we tested various numbers of clusters, the solutions involving more than two clusters merely divided the European Americans into subgroups; for maximum parsimony, we used only the two cluster solution. The STRP results were used as a gold standard for comparison.
To compare the assignment probabilities of each individual between the SNPs and STRPs we took the absolute value of the differences between the probability generated by each set of SNPs and the one generated by the STRPs in each individual averaged across the individuals genotyped in both sets. We subtracted that number from one to obtain a measure of percent similarity between the SNP and STRP results. For example, assume that the probability of membership in groups 1 and 2 for a single individual in a SNP run is completely ambiguous (0.5 and 0.5, respectively). When this is compared with STRP probabilities of approximately 1 and 0; percent similarity for that individual would be calculated as, 1 - |1 - 0.5| = 0.5, or 50% similarity. This score implies a completely ambiguous assignment versus an absolute assignment and is the lowest expected score.
We also compared the frequency of SNPs in the self-reported European and African American groups. First the SNPs were sorted by the percentage of heterozygotes present in the European American individuals. Then we took the squared difference of the allele frequency of each group for each SNP in both the Illumina and Affymetrix datasets; this value is considered a measure of information content for each SNP. By this measure, a SNP that is fixed for one allele in European Americans and the other allele in African Americans would have an information content of 1.
Probability of cluster membership based on the 328 COGA STRPs
Average probability of membership in cluster
Self reported race
In the case of the Affymetrix low frequency SNPs, the results do converge on that of the STRPs, but the solution appears to be multimodal . This artifact may be due to the extremely low MAFs in that set (Table 1).
The results in Figure 1 support the findings of a previous study that suggests that approximately 100 SNPs are required to resolve population structure in a sample . However, these data also suggest that when SNPs with low MAFs are used, greater than 250 are required to obtain reliable results. We hypothesized that the SNPs with MAFs of near 0.50 may be much older than the origins of modern human races and therefore less informative than lower frequency SNPs that may have appeared closer to the time that modern Homo sapiens left Africa. However, these data suggest that on average high frequency SNPs provide better information. In fact, fewer SNPs of high frequency than low frequency are required to closely approach the STRP results. It appears that the potential greater sensitivity of more recent SNPs does not translate into an increase in useful information. It is also interesting to note that the similarity between the STRP and SNP runs never reaches 100%. The 96% similarity that is reached by these SNP sets is equivalent to an average difference in probability of just 4% per individual. This is likely due to the MCMC nature of STRUCTURE; multiple runs of the STRP dataset show a similar level of variation (even with 100,000 burn-in and 100,000 iterations).
By our measure of information content, low-frequency SNPs have the greatest potential to be informative. In fact, the 10 most informative SNPs have less than 20% heterozygotes in the European American sample. However, these low frequency SNPs also have great potential to provide information that is useless or misleading with regard to the ancestry of an individual. The vast majority of SNPs with a low percentage of heterozygotes provide little or no useful information. As a result, high frequency SNPs, on average, seem to provide better information. This is a likely explanation for why the results in Figure 1 show it requires far fewer SNPs of high MAF to resolve population structure effectively in this sample.
The 20 most informative SNPs, their information content (IC), and MAFs in the COGA GAW14 data
These data suggest that SNPs are a cost effective and informative replacement for STRPs when used to detect population structure. Based on these results, a set of approximately 100 SNPs with MAFs of 40–50% can resolve population structure. If SNPs with lower MAFs are randomly chosen, then more than 250 SNPs may be required to obtain reliable results. However, these results identify a subset of 20 SNPs that also reliably resolve population structure in this sample. These results suggest that a small "genomic control subset" selected based on allele frequency differences in the two populations could be quite useful. Although the regression indicates that any SNP, on average, shows a 5% allele frequency difference between the two populations, our results show that the SNPs with higher MAF are more useful for the STRUCTURE analyses. Thus, it is better, in general, to choose a SNP with MAF of 0.45 and 0.5 in Caucasians and African Americans than to choose a SNP with MAF of 0 and 0.05.
Collaborative Study on the Genetics of Alcoholism
Minor allele frequencies
Markov chain Monte Carlo
Short tandem repeat polymorphisms
- Knowler WC, Williams RC, Pettitt DJ, Steinberg AG: Gm3-5,13,14 and type-2 diabetes-mellitus – and association in American-Indians with genetic admixture. Am J Hum Genet. 1988, 43: 520-526.PubMed CentralPubMedGoogle Scholar
- Pritchard JK, Rosenberg NA: Use of unlinked genetic markers to detect population stratification in association studies. Am J Hum Gen. 1999, 65: 220-228. 10.1086/302449.View ArticleGoogle Scholar
- Risch N: Searching for genetic determinants in the new millennium. Nature. 2000, 405: 847-856. 10.1038/35015718.View ArticlePubMedGoogle Scholar
- Templeton AR, Weiss KM, Nickerson DA, Boerwinkle E, Sing CF: Cladistic structure within the human lipoprotein lipase gene and its implications for phenotypic association studies. Genetics. 2000, 156: 1259-1275.PubMed CentralPubMedGoogle Scholar
- Edenberg HJ, Bierut LJ, Boyce P, Cao M, Cawley S, Chiles R, Doheny KF, Hansen M, Hinrichs T, Jones K, Kelleher M, Kennedy GC, Liu G, Marcus G, McBride C, Murray SS, Oliphant A, Pettengill J, Porjesz B, Pugh EW, Rice JP, Rubano T, Shannon S, Steeke R, Tischfield JA, Tsai YY, Zhang C, Begleiter H: Description of the data from the Collaborative Study on the Genetics of Alcoholism (COGA) and single-nucleotide polymorphism genotyping for Genetic Analysis Workshop 14. BMC Genet. 6 (Suppl 1): S2-10.1186/1471-2156-6-S1-S2.
- Pritchard JK, Stephens M, Donnelly P: Inference of population structure using multilocus genotype data. Genetics. 2000, 155: 945-959.PubMed CentralPubMedGoogle Scholar
- Pritchard JK, Donnelly P: Case-control studies of association in structured or admixed populations. Theor Popul Biol. 2001, 60: 227-237. 10.1006/tpbi.2001.1543.View ArticlePubMedGoogle Scholar
- Turakulov R, Easteal S: Number of SNPS loci needed to detect population structure. Hum Hered. 2003, 55: 37-45. 10.1159/000071808.View ArticlePubMedGoogle Scholar
- Parra EJ, Marcini A, Akey J, Martinson J, Batzer MA, Cooper R, Forrester T, Allison DB, Deka R, Ferrell RE, Shriver MD: Estimating African American admixture proportions by use of population-specific alleles. Am J Hum Genet. 1998, 63: 1839-1851. 10.1086/302148.PubMed CentralView ArticlePubMedGoogle Scholar
- Oksenberg JR, Barcellos LF, Cree BA, Baranzini SE, Bugawan TL, Khan O, Lincoln RR, Swerdlin A, Mignot E, Lin L, Goodin D, Erlich HA, Schmidt S, Thomson G, Reich DE, Pericak-Vance MA, Hains JL, Hauser SL: Mapping multiple sclerosis susceptibility to the HLA-DR locus in African Americans. Am J Hum Genet. 2004, 74: 160-167. 10.1086/380997.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.