# Clustering by genetic ancestry using genome-wide SNP data

- Nadia Solovieff
^{1}Email author, - Stephen W Hartley
^{1}, - Clinton T Baldwin
^{2}, - Thomas T Perls
^{3}, - Martin H Steinberg
^{4}and - Paola Sebastiani
^{1}

**11**:108

https://doi.org/10.1186/1471-2156-11-108

© Solovieff et al; licensee BioMed Central Ltd. 2010

**Received: **11 June 2010

**Accepted: **9 December 2010

**Published: **9 December 2010

## Abstract

### Background

Population stratification can cause spurious associations in a genome-wide association study (GWAS), and occurs when differences in allele frequencies of single nucleotide polymorphisms (SNPs) are due to ancestral differences between cases and controls rather than the trait of interest. Principal components analysis (PCA) is the established approach to detect population substructure using genome-wide data and to adjust the genetic association for stratification by including the top principal components in the analysis. An alternative solution is genetic matching of cases and controls that requires, however, well defined population strata for appropriate selection of cases and controls.

### Results

We developed a novel algorithm to cluster individuals into groups with similar ancestral backgrounds based on the principal components computed by PCA. We demonstrate the effectiveness of our algorithm in real and simulated data, and show that matching cases and controls using the clusters assigned by the algorithm substantially reduces population stratification bias. Through simulation we show that the power of our method is higher than adjustment for PCs in certain situations.

### Conclusions

In addition to reducing population stratification bias and improving power, matching creates a clean dataset free of population stratification which can then be used to build prediction models without including variables to adjust for ancestry. The cluster assignments also allow for the estimation of genetic heterogeneity by examining cluster specific effects.

## Background

In GWAS, varying ancestral backgrounds lead to population stratification inflating the type I error rate [1, 2]. Even European Americans are affected by population stratification bias [3].

PCA [4], spectral graph theory [5], structured association analysis [6, 7] and genomic control [8] can detect underlying population substructure with SNP data. Investigators typically use PCA to detect population structure and then adjust the estimates of genetic effects for the top number of principal components (PCs) to control for population stratification bias [4]. Genomic control divides the test statistic for each SNP by the genomic control inflation factor (λ), defined as the observed median test statistic across all tests genome-wide divided by the expected median test statistic. The test statistic for each SNP is divided by the same value, irrespective of whether the particular SNP is structured, and thus results in a loss of power [4]. A wide range of alternative methods have been proposed to adjust for population stratification by applying an adjustment to the test statistic, using permutation tests or by performing stratified analyses [9–11].

Alternatively, population stratification can be obviated by matching cases and control with respect to genetic ancestry [12]. Matching does not require the investigator to account for population stratification bias by adjusting for PCs or dividing test statistics by the genomic control inflation factor. Guan et al [13] match cases and controls using a genetic similarity score computed directly from genotype data as a weighted identity by state estimate.

We propose a different algorithm in which we identify clusters of varying genetic ancestry using the results of PCA from genome-wide data. The algorithm includes a novel approach to choosing the appropriate number of informative PCs, a clustering step to group subjects into clusters of genetic diversity and a novel "scoring index" (SI) to choose the appropriate number of clusters. Luca et al [14] implemented a similar algorithm for matching cases and controls with the main difference being that parametric tests that may not be robust to departures from model assumptions are used while our algorithm is non-parametric and does not require strict assumptions.

We test the sensitivity and specificity of our algorithm on simulated data and provide applications on African populations from the Human Genome Diversity Project [15] and a cohort of centenarians with European ancestry [16]. We demonstrate that by matching cases and controls within each cluster the inflation in test statistics caused by population stratification bias substantially decreases as compared to unmatched cases and controls and in certain situations matching is more powerful than adjusting for the top number of PCs. More importantly, our method allows for the creation of a dataset free of population stratification which can then be used for prediction modeling [17]. Prediction models containing principal components as covariates are study specific and cannot easily be generalized to a different study in which the values of the PCs are unknown. The clusters produced by the algorithm also allow for the exploration of locus heterogeneity and we provide an example of a SNP in *APOE* in which the odds ratio varies widely across ethnic groups in Europe.

## Results

### Algorithm

### Simulations

Power of Scoring Index

Scenario 1: Equal N | Scenario 2: Unequal N | |||
---|---|---|---|---|

K | Max SI | Optimal SI | Max SI | Optimal SI |

2 | 0.78 | 1 | 0.56 | 0.99 |

3 | 0.97 | 1 | 0.64 | 0.97 |

4 | 0.99 | 1 | 0.86 | 0.99 |

5 | 1 | 1 | 0.99 | 1 |

6 | 0.99 | 1 | 0.99 | 0.99 |

7 | 1 | 1 | 0.99 | 0.99 |

8 | 1 | 1 | 0.99 | 0.99 |

9 | 1 | 1 | 0.99 | 0.99 |

10 | 1 | 1 | 1 | 1 |

### Real Data with African Ancestry

^{th}cluster and show a stronger similarity with each other than with the other Africans in this analysis. The plots of the PCs support the clustering of these subjects (Figure 3) and are consistent with the origin of the Bantu near the southern boundary of modern Nigeria and Cameroon and the known genetic similarity between Bantu and Yorubans [18, 19]. In the plot of PC6 and PC7 (Figure 3) we see a distinct separation of the Mandenka from the Yorubans and Bantu which indicates that these PCs are important to increase the sensitivity of clustering. The remaining Mozabite subjects were split among 3 clusters. This genetic diversity is consistent with the history of the Mozabites and recent studies that reveal genetic ancestry from sub-Saharan Africa, the Middle East and Europe [18]. Previously reported analyses of these data that were based on inspection of the first two PCs failed to detect this level of diversity based on purely genetic data. Patterson et al [19] could only detect differences between San and Bantu/Yorubans combined. Our analysis shows that a larger number of PCs is necessary to detect finer population structure that could introduce confounding in GWAS and confirms the utility of the heatmap plot of PCs.

### Real Data with European Ancestry

### Effect of Matching on the Power of a GWAS

### Effect of Matching on Power

False Positive Rate of Matched and PC adjusted Unmatched GWAS

Scenario | Cases/Controls | False Positive Rate* |
---|---|---|

Matched | 2000/2000 | 5.00E-07 |

Unmatched + PCs | 2000/2000 | 8.00E-07 |

Unmatched + PCs | 2000/2250 | 9.00E-07 |

Unmatched + PCs | 2000/2500 | 1.30E-06 |

Unmatched + PCs | 2000/2750 | 1.20E-06 |

Unmatched + PCs | 2000/3000 | 9.00E-07 |

Unmatched + PCs | 2000/3250 | 1.00E-06 |

Unmatched + PCs | 2000/3500 | 8.00E-07 |

Unmatched + PCs | 2000/3750 | 9.00E-07 |

Unmatched + PCs | 2000/4000 | 8.00E-07 |

### Locus Heterogeneity

*APOE*known to be associated with exceptional longevity and other age-related diseases [21, 23], varies greatly among European ethnic groups (Figure 9). Interestingly the association between EL and this SNP also varies between clusters with odds ratios ranging from 0.38 in the Italians to 1.09 in the Irish Celtics. (Figure 9) Small differences in allele frequencies can affect the power to detect a true main effect when the SNP is part of a SNP- SNP interaction [24] and thus close examination of cluster specific allele frequencies between discovery and replication sets can be useful for replication.

## Discussion

Our analyses suggest that the algorithm works well in simulated datasets and in real data of Africans and Caucasians of European ancestry and can reduce or eliminate population stratification. Choosing the appropriate number of PCs for clustering is a critical choice. We chose the number of PCs by evaluating patterns observed in a heatmap and in a scree plot; although subjective, in many situations a clear choice is evident. We emphasize that investigators should carefully examine the results of the PCA prior to clustering to ensure that population substructure in fact exists in the data. If population substructure is absent, clustering is unnecessary and will lead to overfitting. Although we use k-means clustering in our algorithm [25], alternative techniques including k-medoids, hierarchical clustering, and model based clustering could be used. Model based clustering requires parametric assumptions about the distribution of the data and it can outperform k-means clustering when the assumptions are correct but perform poorly when they are not.

We have shown that matching is an effective method to reduce the inflation in test statistics due to population stratification both in simulated cases and real data. Matching in some situations is more powerful than using an unmatched dataset and adjusting for PCs. Our method also allows one to create a dataset free of population stratification which can be used to build prediction models. We created a clean dataset for the NECS which was used to build a model of exceptional longevity and we replicated the model in an independent study with high accuracy [17]. The advantage of this approach is that we did not need to account for population substructure in the model and were able to generalize it to an independent study. Including PCs in a prediction model causes the model to be study specific and may be difficult to reproduce.

Furthermore as the availability of external controls increases through organizations like dbSNP, investigators can use this method to match the controls to cases with respect to ancestry and reduce the issue of population stratification. Zhuang et al [27] show that expanding the control group can improve power. Of course, investigators must take precautions to avoid introducing other forms of bias due to the use of external controls.

As next generation sequencing and the discovery and analysis of rare variants continue to emerge, population stratification will remain an obstacle. Our algorithm can easily be implemented to match cases and controls when selecting subjects for sequencing and insure that the selection adequately represents all ethnicities in the sample thus increasing the chance of finding true novel polymorphisms.

## Conclusions

We developed an algorithm to cluster individuals into ethnic groups based on PCs computed from SNP data. We showed that the algorithm works well in real data of Africans and Caucasians with European ancestry and also in simulated data. The cluster assignments can be used to effectively match cases and controls in GWAS to reduce and even eliminate population stratification bias. Matching can also aid in genetic risk prediction models by creating a dataset free of population stratification. Furthermore, the cluster assignments can be used to study locus heterogeneity and identify SNPs and genes that have a different effect on the phenotype in different ethnic groups.

## Methods

### Simulated Datasets

#### Dataset with No Population Substructure

To test the algorithm on a dataset containing no population substructure, we simulated 100,000 SNPs in linkage equilibrium for 2000 individuals with no underlying pattern. The allele frequencies for each SNP were randomly simulated with values ranging between 0.05 and 0.95. Genotype frequencies were computed from the allele frequencies assuming Hardy Weinberg equilibrium and individuals were randomly assigned genotypes according to the genotype frequencies.

#### Dataset with Known Structure

We simulated genotype data in which the number of clusters ranged from k = 2 to 10. Each dataset contained 1000 individuals and 10,000 SNP in linkage equilibrium. Each cluster was simulated with Fst = 0.01 corresponding to differences observed among divergent European populations [28]. For each SNP the founder allele frequency, p, was selected from a uniform(0.05,0.50) distribution, the allele frequency for each cluster, p_{k}, was selected from a beta distribution with shape parameter α = p(1-Fst)/Fst and β = (1-p)*(1-Fst)/Fst and the genotype frequencies were computed as (Fst(1-p_{k}) + (1-Fst)(1- p_{k})^{2}, 2(1-Fst) p_{k}(1- p_{k}), Fst p_{k} + (1-Fst) p_{k}^{2}) [29, 30]. In scenario 1 we split the subjects equally among the clusters, rounding up to the nearest integer. In scenario 2, the sample size within each cluster was randomly chosen with the requirement that each cluster must contain at least 5% of the total sample size. In each scenario, 900 dataset were simulated (100 for each cluster size k). For each dataset, a PCA was performed and the clustering algorithm was used to cluster individuals into groups. The sensitivity of the algorithm is measured by the proportion of datasets in which the cluster size is correctly identified by the scoring index and the algorithm correctly allocates all the subjects to their respective cluster.

#### Power Analysis

To simulate a scenario in which subjects are matched with respect to their ancestry, we simulated 2000 cases and 2000 controls from 4 underlying clusters with the probability (0.4, 0.3, 0.2, 0.1) of being in clusters 1, 2, 3 or 4, respectively. In the unmatched scenario with the same sample size, we generated 2000 cases from 4 clusters with probability (0.4, 0.3, 0.2, 0.1) and 2000 controls with probability (0.1, 0.2, 0.3, 0.4). To examine the effect of adding controls to the unmatched scenario, we simulated a pool of 2000 extra controls from 4 clusters with probability (0.1, 0.2, 0.3, 0.4). We then randomly added a varying number of controls (250, 500, 750, 1000, 1250, 1500, 1750, 2000) to the unmatched scenario. 10,000 SNPs in linkage equilibrium unrelated to disease status were simulated according to the model described in the previous section and were used to estimate the false positive rate. Twenty-four causal SNPs were generated with relative risks (R) of 1.2, 1.3, 1.4, 1.5 and minor allele frequencies (p) of 0.1, 0.2, 0.3, 0.4 and 0.5. The genotype frequencies for the controls were simulated as previously described. For the cases the genotype frequencies were computed as (Fst(1-p_{k}) + (1-Fst)(1- p_{k})^{2}, 2R(1-Fst) p_{k} (1- p_{k}), R^{2}(Fst p_{k} + (1-Fst) p_{k})) where R corresponds to the relative risk. These frequencies were scaled by the sum of the 3 genotype frequencies so that the probabilities added up to 1.

### Real Datasets

#### Human Genome Diversity Project (HGDP) Africans

Populations and Studies

Population/Dataset | N |
---|---|

Human Genome Diversity Project: African | |

Bantu | 20 |

Biaka | 31 |

Mandenka | 24 |

Mbuti pygmy | 15 |

Mozabite | 30 |

San | 6 |

Yoruba | 24 |

New England Centenarian Study (NECS) | 1341 |

Illumina Database | 3613 |

#### New England Centenarian Study (NECS) set

A subset of subjects from the New England Centenarian Study containing 1051 cases and 290 controls was used to test the algorithm. This study was approved by the Boston University Institutional Review Board. The initial PCA, containing 298,734 SNPs, identified many chromosomal regions with elevated SNP weights due to strong LD in those regions. We therefore removed SNPs in strong LD using the program PLINK [31] with a SNP window of 50, sliding window of 5 SNPs and removed 1 SNP from each pair of SNPs with r^{2} > 0.30. The final dataset contained 96,457 SNPs with a call rate greater than 95% and minor allele frequency greater than 5%. PCs from final dataset did not have elevated SNP weights for any chromosomal regions for the top 20 PCs. We found that setting the r^{2} threshold higher than 0.30 resulted in elevated SNP weights in many chromosomal regions. All subjects had a call rate greater than 93%.

#### NECS and Illumina Control set

3,613 controls labeled as Caucasian were selected from the Illumina control database (iControlDB) and combined with the NECS controls. There were 298,734 SNPs common to the NECS and Illumina datasets that had a SNP call rate > 0.95 and MAF > 0.05. SNPs in strong LD were removed using the program PLINK with a SNP window of 50 and sliding window of 5 SNPs and we removed 1 SNP from each pair of SNP with r^{2} > 0.30 leaving 97,508 SNPs for the analysis. A PCA was performed on the combined data. All subjects had a call rate greater than 93%. To check for differences between the two datasets, we compared the MAFs between the NECS dataset and Illumina controls. All SNPs had less than a 10% difference in MAFs between the two datasets and 99.9% of the SNPs had less than a 5% difference in MAF. We also compared the PCs using only the NECS dataset with the PCs using the NECS and Illumina datasets. We found that the top 4 PCs which were used for clustering were strongly correlated among the NECS between the two PCAs showing that the addition of the Illumina controls did not bias the results of the PCA. The correlation coefficients (Spearman) were 0.98, -0.89, 0.87 and 0.76 for the 1st-4th PCs, respectively.

#### Principal component analysis

In all applications we used the principal components analysis implemented in the software smartpca [4] to detect population substructure among individuals with genome-wide data.

### Algorithm

The algorithm consists of the following components: 1) selection of informative PCs for the cluster analysis; 2) clustering to discover population substructure; 3) a novel score to select the best number of clusters that satisfy a variety of criteria.

### How Many Informative PCs?

The Tracy-Widom statistic can be used to identify the ancestrally informative principal components. However this statistic is very sensitive to the inclusion of SNPs in linkage disequilibrium (LD) and tends to identify a larger number of PCs [19, 32]. Typically, investigators use the first 10 PCs although this choice is somewhat arbitrary. We identify informative PCs for the algorithm by displaying the results of the PCA in a heatmap and by the use of a scree plot [33] which plots the natural logarithm of the eigenvalues.

We display the top 20 principal components in a heatmap in which the color of each cell (i, j) represents the value of the principal component in column j for the subject in row i, standardized by row. We order the rows using hierarchical clustering so that individuals with similar values for PCs are arranged next to each other. The heatmap highlights the most evident groups and the number of PCs that determine these groups. Because the interpretation of the visual display is subjective, we also use a scree plot of the natural logarithm of the eigenvalues to identify the important PCs. A scree plot graphs the log of the eigenvalue for each PC versus the PC numbers, and the appropriate number of PCs is identified by a "kink" in the plot after which we observe a relatively straight line (Figure 1).

### K-Means Clustering of the Most Informative PCs

and x_{i} denotes the vector of values of the first p principal components for subject i.

### Scoring Index (SI)

To identify the optimal number of clusters, we propose an algorithm and SI which evaluates the clustering accuracy, stability, and between cluster distance. The algorithm performs k-means clustering for each cluster size, k = 2, 3,...K, for M executions and computes the accuracy, stability and between cluster distance for each cluster size and execution. The rationale for incorporating these 3 measures into a scoring index is that the optimal cluster assignment should accurately allocate subjects to their respective cluster, should be stable from execution to execution of k-means and should maximize the distance between subjects allocated to different clusters. These three measures are computed in the observed data and are compared to those expected under random cluster allocation using permutation analysis. We summarize the gain in accuracy, stability and distance into a scoring index which is used to identify the optimal number of clusters. We discuss these steps in detail:

#### Cluster Accuracy

To measure the accuracy of each set of K clusters, we build a linear discriminant model using the cluster assignments from k-means, and then perform leave-one-out cross validation to estimate the accuracy to predict the cluster membership based on the linear discriminant model. If the clustering is accurate, we expect the linear discriminant model to accurately predict an individual's cluster membership. For each m, we compute the accuracy as the proportion of individuals assigned to the same cluster in cross validation as in k-means.

#### Cluster Stability

Since each execution of k-means clustering can produce different cluster assignments, we perform multiple executions (m = 1,..., M) of the clustering algorithm for each k, and measure the stability of the results. The stability will be worse for incorrect group sizes since k-means will be maximizing to a different local maximum each time, and thus low stability suggests that the number of clusters is not optimal. To measure the stability of the cluster assignments, we compute the Rand statistic [34] between the k-means cluster assignments for each number of clusters. The Rand statistic estimates the agreement between two sets of clusters by dividing the number of pairs of individuals in either the same cluster or in different clusters for both sets by the total number of pairs of individuals. Specifically, let *C*_{1}, ..., *C*_{
k
}and *X*_{1}, ...., *X*_{
k
}denote two sets of clusters generated in two executions of k-means for a fixed k. Let S be the number of pairs of subjects in the same cluster in both sets (for example subjects s and s' allocated both to *C*_{
i
}and *X*_{
j
}). Let D denote the number of pairs of subjects in different clusters for both sets (for example s in cluster *C*_{
i
}and *X*_{
i
}and s' in *C*_{
j
}and *X*_{
j
}), and T be the total number of pairs of subjects. Then the Rand statistic is defined as R = (S+D)/T. For each execution of k-means, we randomly choose 1 other execution of the same cluster size and compute the Rand statistic.

#### Distance: Between Cluster Scatter

At each cluster size k and execution m, the algorithm computes the normalized between-cluster distance, to monitor how distinct the clusters are from one another. This measure will always increase monotonically as the number of clusters increases. The between cluster distance provides information about the optimal number of clusters, however it does not necessarily provide a distinct number of clusters and does not measure the accuracy nor the stability of the cluster assignments.

#### Permutation

Accuracy, stability and between cluster distance are all dependent on the number of clusters and on the sample size in each cluster making it inappropriate to directly compare these measures across cluster sizes. For example, as the number of clusters increases, the accuracy decreases simply because it is more difficult to correctly predict an individual's cluster assignment with more available clusters (Figures 2, 4). Distance between clusters, on the other hand, will always increase as the number of clusters increases (Figures 2, 4). Therefore, we generate referent values for each k by randomly permuting the cluster labels generated in each execution of k-means and then compute the accuracy, stability and between cluster distance of the permuted cluster assignment. We then compute the relative gain in the observed accuracy, stability and between cluster distance to the permuted accuracy, stability and between cluster distance and combine the measures into a SI. Note that, since the permuted cluster assignments have the same number of clusters and subjects per cluster as the observed k-means cluster assignments, we are simulating these measures from the appropriate underlying distribution.

#### Scoring Index

*A*

_{ O, m, k }and

*A*

_{ P, m, k }are the observed and permuted accuracy,

*B*

_{ O, m, k }and

*B*

_{ P, m, k }are the observed and permuted between cluster distance and

*S*

_{ O, m, k }and

*S*

_{ P, m, k }are the observed and permuted stability. The SI is computed as:

- 1.
$\frac{\left({A}_{O,m,k}-{A}_{P,m,k}\right)}{\left(1-{A}_{P,m,k}\right)}~beta\left(N{A}_{O,m,k}-N{A}_{P,m,k},N-N{A}_{O,m,k}\right)$, where N is the total sample size and

*NA*_{ O, m, k }and*NA*_{ P, m, k }represent the number of subjects correctly assigned to their cluster for the observed and permuted assignments, respectively, for execution m and cluster size k. - 2.
$\frac{{B}_{O,m,k}-{B}_{P,m,k}}{1-{B}_{P,m,k}}~beta\left(SS{B}_{O,m,k}-SS{B}_{P,m,k},SST-SS{B}_{O,m,k}\right)$, where SST is the total distance between subjects measured by the informative PCs used for clustering and

*SSB*_{ O, m, k }and*SSB*_{ P, m, k }are the sums of squares between clusters for the observed and permuted assignment, respectively, for execution m and cluster size k. - 3.
$\frac{\left({S}_{O,m,k}-{S}_{P,m,k}\right)}{\left(1-{S}_{P,m,k}\right)}~beta\left(N{S}_{O,m,k}-N{S}_{P,m,k},NS-N{S}_{O,m,k}\right)$, where NS represents the total number of pairs of subjects and

*NS*_{ O, m, k }and*NS*_{ P, m, k }are the number of concordant and discordant pairs for the observed and permuted cluster assignments, respectively, for execution m and cluster size k.

*SI*

_{ k }is computed as

and the 95% confidence interval is computed as the 2.5^{th} and 97.5^{th} quantile of a beta(α,β) where $\alpha =\frac{(1-S{I}_{k})S{I}_{k}^{2}}{Var(S{I}_{k})}-S{I}_{k}$, $\beta =\frac{\alpha (1-S{I}_{k})}{S{I}_{k}}$

### Gap Statistic

The gap statistic [26] was computed for the HGDP Africans and the NECS using the gap() function of the SAGx package [35] in R [36]. The function is extremely computer intensive and thus we computed the gap statistic on 5 random cluster assignments, as computed by k-means, out of the total 100 executions for each cluster size. The statistic could not be computed on the HGDP African dataset for cluster sizes larger than 22 because at least 1 cluster contained only 1 subject.

### GWAS Analysis

All GWAS used a logistic regression model with an additive model for the SNP genotype and all SNPs had a call rate > 0.95 and MAF > 0.01. Analyses were performed using the software PLINK.

## Declarations

### Acknowledgements

This work was supported by National Institutes of Health grants: R01 HL087681 and RC2HL101212 (to M.H.S), and K24 AG025727 (to T.T.P).

## Authors’ Affiliations

## References

- Altshuler D, Daly MJ, Lander ES: Genetic mapping in human disease. Science. 2008, 322 (5903): 881-888. 10.1126/science.1156409.PubMed CentralView ArticlePubMedGoogle Scholar
- Marchini J, Cardon LR, Phillips MS, Donnelly P: The effects of human population structure on large genetic association studies. Nat Genet. 2004, 36 (5): 512-517. 10.1038/ng1337.View ArticlePubMedGoogle Scholar
- Price AL, Butler J, Patterson N, Capelli C, Pascali VL, Scarnicci F, Ruiz-Linares A, Groop L, Saetta AA, Korkolopoulou P, Seligsohn U, Waliszewska A, Schirmer C, Ardlie K, Ramos A, Nemesh J, Arbeitman L, Goldstein DB, Reich D, Hirschhorn JN: Discerning the ancestry of European Americans in genetic association studies. PLoS Genet. 2008, 4 (1): e236-10.1371/journal.pgen.0030236.PubMed CentralView ArticlePubMedGoogle Scholar
- Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D: Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006, 38 (8): 904-909. 10.1038/ng1847.View ArticlePubMedGoogle Scholar
- Lee AB, Luca D, Klei L, Devlin B, Roeder K: Discovering genetic ancestry using spectral graph theory. Genet Epidemiol. 2010, 34 (1): 51-59.PubMed CentralPubMedGoogle Scholar
- Pritchard JK, Stephens M, Donnelly P: Inference of population structure using multilocus genotype data. Genetics. 2000, 155 (2): 945-959.PubMed CentralPubMedGoogle Scholar
- Falush D, Stephens M, Pritchard JK: Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics. 2003, 164 (4): 1567-1587.PubMed CentralPubMedGoogle Scholar
- Devlin B, Roeder K: Genomic control for association studies. Biometrics. 1999, 55 (4): 997-1004. 10.1111/j.0006-341X.1999.00997.x.View ArticlePubMedGoogle Scholar
- Epstein MP, Allen AS, Satten GA: A simple and improved correction for population stratification in case-control studies. Am J Hum Genet. 2007, 80 (5): 921-930. 10.1086/516842.PubMed CentralView ArticlePubMedGoogle Scholar
- Kimmel G, Jordan MI, Halperin E, Shamir R, Karp RM: A randomization test for controlling population stratification in whole-genome association studies. Am J Hum Genet. 2007, 81 (5): 895-905. 10.1086/521372.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang K: Testing for genetic association in the presence of population stratification in genome-wide association studies. Genet Epidemiol. 2009, 33 (7): 637-645. 10.1002/gepi.20415.View ArticlePubMedGoogle Scholar
- Hinds DA, Stokowski RP, Patil N, Konvicka K, Kershenobich D, Cox DR, Ballinger DG: Matching strategies for genetic association studies in structured populations. Am J Hum Genet. 2004, 74 (2): 317-325. 10.1086/381716.PubMed CentralView ArticlePubMedGoogle Scholar
- Guan W, Liang L, Boehnke M, Abecasis GR: Genotype-based matching to correct for population stratification in large-scale case-control genetic association studies. Genet Epidemiol. 2009, 33 (6): 508-517. 10.1002/gepi.20403.PubMed CentralView ArticlePubMedGoogle Scholar
- Luca D, Ringquist S, Klei L, Lee AB, Gieger C, Wichmann HE, Schreiber S, Krawczak M, Lu Y, Styche A, Devlin B, Roeder K, Trucco M: On the use of general control samples for genome-wide association studies: genetic matching highlights causal variants. Am J Hum Genet. 2008, 82 (2): 453-463. 10.1016/j.ajhg.2007.11.003.PubMed CentralView ArticlePubMedGoogle Scholar
- Li JZ, Absher DM, Tang H, Southwick AM, Casto AM, Ramachandran S, Cann HM, Barsh GS, Feldman M, Cavalli-Sforza LL, Myers RM: Worldwide human relationships inferred from genome-wide patterns of variation. Science. 2008, 319 (5866): 1100-1104. 10.1126/science.1153717.View ArticlePubMedGoogle Scholar
- Terry DF, Sebastiani P, Andersen SL, Perls TT: Disentangling the roles of disability and morbidity in survival to exceptional old age. Arch Intern Med. 2008, 168 (3): 277-283. 10.1001/archinternmed.2007.75.PubMed CentralView ArticlePubMedGoogle Scholar
- Sebastiani P, Solovieff N, Puca A, Hartley SW, Melista E, Andersen S, Dworkis DA, Wilk JB, Myers RH, Steinberg MH, Montano M, Baldwin CT, Perls TT: Genetic Signatures of Exceptional Longevity in Humans. Science. 2010Google Scholar
- Campbell MC, Tishkoff SA: African genetic diversity: implications for human demographic history, modern human origins, and complex disease mapping. Annu Rev Genomics Hum Genet. 2008, 9: 403-433. 10.1146/annurev.genom.9.081307.164258.PubMed CentralView ArticlePubMedGoogle Scholar
- Patterson N, Price AL, Reich D: Population structure and eigenanalysis. PLoS Genet. 2006, 2 (12): e190-10.1371/journal.pgen.0020190.PubMed CentralView ArticlePubMedGoogle Scholar
- Tian C, Plenge RM, Ransom M, Lee A, Villoslada P, Selmi C, Klareskog L, Pulver AE, Qi L, Gregersen PK, Seldin MF: Analysis and application of European genetic substructure using 300 K SNP information. PLoS Genet. 2008, 4 (1): e4-10.1371/journal.pgen.0040004.PubMed CentralView ArticlePubMedGoogle Scholar
- Christensen K, Johnson TE, Vaupel JW: The quest for genetic determinants of human longevity: challenges and insights. Nat Rev Genet. 2006, 7 (6): 436-448. 10.1038/nrg1871.PubMed CentralView ArticlePubMedGoogle Scholar
- Aulchenko YS, Ripatti S, Lindqvist I, Boomsma D, Heid IM, Pramstaller PP, Penninx BW, Janssens AC, Wilson JF, Spector T, Martin NG, Pedersen NL, Kyvik KO, Kaprio J, Hofman A, Freimer NB, Jarvelin MR, Gyllensten U, Campbell H, Rudan I, Johansson A, Marroni F, Hayward C, Vitart V, Jonasson I, Pattaro C, Wright A, Hastie N, Pichler I, Hicks AA, Falchi M, Willemsen G, Hottenga JJ, de Geus EJ, Montgomery GW, Whitfield J, Magnusson P, Saharinen J, Perola M, Silander K, Isaacs A, Sijbrands EJ, Uitterlinden AG, Witteman JC, Oostra BA, Elliott P, Ruokonen A, Sabatti C, Gieger C, Meitinger T, Kronenberg F, Doring A, Wichmann HE, Smit JH, McCarthy MI, van Duijn CM, Peltonen L, ENGAGE Consortium: Loci influencing lipid levels and coronary heart disease risk in 16 European population cohorts. Nat Genet. 2009, 41 (1): 47-55. 10.1038/ng.269.PubMed CentralView ArticlePubMedGoogle Scholar
- Sabatti C, Service SK, Hartikainen AL, Pouta A, Ripatti S, Brodsky J, Jones CG, Zaitlen NA, Varilo T, Kaakinen M, Sovio U, Ruokonen A, Laitinen J, Jakkula E, Coin L, Hoggart C, Collins A, Turunen H, Gabriel S, Elliot P, McCarthy MI, Daly MJ, Jarvelin MR, Freimer NB, Peltonen L: Genome-wide association analysis of metabolic traits in a birth cohort from a founder population. Nat Genet. 2009, 41 (1): 35-46. 10.1038/ng.271.PubMed CentralView ArticlePubMedGoogle Scholar
- Greene CS, Penrod NM, Williams SM, Moore JH: Failure to replicate a genetic association may provide important clues about genetic architecture. PLoS One. 2009, 4 (6): e5639-10.1371/journal.pone.0005639.PubMed CentralView ArticlePubMedGoogle Scholar
- Hastie T, Tibshirani R, Friedman JH: The elements of statistical learning: data mining, inference, and prediction. 2009, New York: Springer, 2View ArticleGoogle Scholar
- Tibshirani R, Walther G, Hastie T: Estimating the Number of Clusters in a Data Set via the Gap Statistic. Journal of the Royal Statistical Society. Series B (Statistical Methodology). 2001, 63 (2): 411-423. 10.1111/1467-9868.00293.View ArticleGoogle Scholar
- Zhuang JJ, Zondervan K, Nyberg F, Harbron C, Jawaid A, Cardon LR, Barratt BJ, Morris AP: Optimizing the power of genome-wide association studies by using publicly available reference samples to expand the control group. Genet Epidemiol. 2010, 34 (4): 319-326. 10.1002/gepi.20482.PubMed CentralView ArticlePubMedGoogle Scholar
- Cavalli-Sforza LL, Menozzi P, Piazza A: The history and geography of human genes. 1994, Princeton, N.J.: Princeton University PressGoogle Scholar
- Balding DJ, Nichols RA: A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica. 1995, 96 (1-2): 3-12. 10.1007/BF01441146.View ArticlePubMedGoogle Scholar
- Holsinger KE, Weir BS: Genetics in geographically structured populations: defining, estimating and interpreting F(ST). Nat Rev Genet. 2009, 10 (9): 639-650. 10.1038/nrg2611.View ArticlePubMedGoogle Scholar
- Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, Sham PC: PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007, 81 (3): 559-575. 10.1086/519795.PubMed CentralView ArticlePubMedGoogle Scholar
- Peloso GM, Timofeev N, Lunetta KL: Principal-component-based population structure adjustment in the North American Rheumatoid Arthritis Consortium data: impact of single-nucleotide polymorphism set and analysis method. BMC Proc. 2009, 3 (Suppl 7): S108-10.1186/1753-6561-3-s7-s108.PubMed CentralView ArticlePubMedGoogle Scholar
- Jolliffe IT: Principal component analysis. 2002, New York: Springer-Verlag, 2Google Scholar
- Rand WM: Objective Criteria for the Evaluation of Clustering Methods. Journal of the American Statistical Association. 1971, 66 (336): 846-850. 10.2307/2284239.View ArticleGoogle Scholar
- Broberg P: SAGx: Statistical Analysis of the GeneChip. 2008, R package version 1.14.0Google Scholar
- R: A language and environment for statistical computing. 2008, R Development Core Team, 2.7.0Google Scholar
- Predominant Ethnic Group by Region. [http://www.eupedia.com/europe/maps_of_europe.shtml#ethnicities]

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.