# Discriminant analysis of principal components: a new method for the analysis of genetically structured populations

- Thibaut Jombart
^{1}Email author, - Sébastien Devillard
^{2}and - François Balloux
^{1}Email author

**11**:94

**DOI: **10.1186/1471-2156-11-94

© Jombart et al; licensee BioMed Central Ltd. 2010

**Received: **22 June 2010

**Accepted: **15 October 2010

**Published: **15 October 2010

## Abstract

### Background

The dramatic progress in sequencing technologies offers unprecedented prospects for deciphering the organization of natural populations in space and time. However, the size of the datasets generated also poses some daunting challenges. In particular, Bayesian clustering algorithms based on pre-defined population genetics models such as the STRUCTURE or BAPS software may not be able to cope with this unprecedented amount of data. Thus, there is a need for less computer-intensive approaches. Multivariate analyses seem particularly appealing as they are specifically devoted to extracting information from large datasets. Unfortunately, currently available multivariate methods still lack some essential features needed to study the genetic structure of natural populations.

### Results

We introduce the *Discriminant Analysis of Principal Components* (DAPC), a multivariate method designed to identify and describe clusters of genetically related individuals. When group priors are lacking, DAPC uses sequential K-means and model selection to infer genetic clusters. Our approach allows extracting rich information from genetic data, providing assignment of individuals to groups, a visual assessment of between-population differentiation, and contribution of individual alleles to population structuring. We evaluate the performance of our method using simulated data, which were also analyzed using STRUCTURE as a benchmark. Additionally, we illustrate the method by analyzing microsatellite polymorphism in worldwide human populations and hemagglutinin gene sequence variation in seasonal influenza.

### Conclusions

Analysis of simulated data revealed that our approach performs generally better than STRUCTURE at characterizing population subdivision. The tools implemented in DAPC for the identification of clusters and graphical representation of between-group structures allow to unravel complex population structures. Our approach is also faster than Bayesian clustering algorithms by several orders of magnitude, and may be applicable to a wider range of datasets.

## Background

The study of the genetic structure of biological populations has attracted a growing interest from a wide array of fields, such as population biology, molecular ecology, and medical genetics. One of the most widely applied approaches is the inference of population structuring with Bayesian clustering methods such as STRUCTURE [1, 2] and BAPS [3, 4]. These methods are particularly appealing as they allow for identifying genetic clusters under an explicit population genetics model. The popularity of these approaches leaves no doubt about their usefulness for extracting meaningful information from genetic data.

Unfortunately, the reliance of Bayesian clustering methods on explicit models also comes at a cost. Model-based approaches rely on assumptions such as the type of population subdivision, which are often difficult to verify and can restrict their applicability. Furthermore, estimation of a large number of parameters [5] can require considerable computational time when analyzing large datasets. To take full advantage of the increase in size and complexity of genetic datasets, fast and flexible exploratory tools are equally needed.

Multivariate analyses have been used for decades to extract various types of information from genetic data and have attracted renewed interest in the field [6–12]. In particular, principal component analysis (PCA) [13–15] has recently been suggested as an alternative to Bayesian clustering algorithms [5, 11, 12, 16]. The main asset of PCA is its ability to identify genetic structures in very large datasets within negligible computational time, and the absence of any assumption about the underlying population genetic model.

*a priori*definition of clusters to study population structures. But even then, PCA would not be appropriate to obtain a clear picture of between-population variation (Figure 1). PCA aims to summarize the overall variability among individuals, which includes both the divergence between groups (

*i.e.*, structured genetic variability), and the variation occurring within groups ('random' genetic variability). To assess the relationships between different clusters, an adequate method should focus on between-group variability, while neglecting within-group variation.

This is precisely the rationale of Discriminant Analysis (DA) [17, 18]. This multivariate method defines a model in which genetic variation is partitioned into a between-group and a within-group component, and yields synthetic variables which maximize the first while minimizing the second (Figure 1). In other words, DA attempts to summarize the genetic differentiation between groups, while overlooking within-group variation. The method therefore achieves the best *discrimination* of individuals into pre-defined groups (Figure 1c). Interestingly, this method also allows for a probabilistic assignment of individuals to each group, as in Bayesian clustering methods.

Unfortunately, DA suffers from considerable restrictions which often preclude its application to genetic data. First, the method requires the number of variables (alleles) to be less than the number of observations (individuals). This condition is generally not fulfilled in Single Nucleotide Polymorphism (SNP) or re-sequencing datasets. Second, it is hampered by correlations between variables, which necessarily occur in allele frequencies due to the constant-row sum constraint [i.e., compositional data, [19, 20]]. Moreover, the violation of the assumption of uncorrelated variables will be even more blatant in the presence of linkage disequilibrium. Therefore, the application of DA to genetic data has remained very limited so far [8, 21].

In this paper, we introduce the Discriminant Analysis of Principal Components (DAPC), a new methodological approach which retains all assets of DA without being burdened by its limitations. DAPC relies on data transformation using PCA as a prior step to DA, which ensures that variables submitted to DA are perfectly uncorrelated, and that their number is less than that of analysed individuals. Without implying a necessary loss of genetic information, this transformation allows DA to be applied to any genetic data. Like PCA, our approach can be applied to very large datasets, such as hundreds of thousands of SNPs typed for thousands of individuals. Moreover, the contributions of alleles to the structures identified by DAPC can allow for identifying regions of the genome driving genetic divergence among groups. Along with the assignment of individuals to clusters, our method provides a visual assessment of between-population genetic structures, permitting to infer complex patterns such as hierarchical clustering or clines.

Whenever group priors are unknown, we use K-means clustering of principal components to identify groups of individuals [5, 16]. K-means relies on the same model as DA to partition genetic variation into a between-group and a within-group component, and attempts to find groups that minimize the latter. Like in STRUCTURE, we run K-means clustering with different numbers of clusters, each of which gives rise to a statistical model and an associated likelihood. As advocated in previous studies [5, 22], we use Bayesian Information Criterion (BIC) to assess the best supported model, and therefore the number and nature of clusters.

We apply DAPC to both simulated and empirical datasets. We use simulations to assess the ability of our approach to infer the right genetic clusters, and compare our results to those obtained with STRUCTURE [1, 2]. Then, we illustrate the type of information that can be gathered by DAPC by applying the method to two empirical datasets. First, we analyse worldwide structuring of native human populations using the HGDP-CEPH cell line panel typed for microsatellite markers [23–25], enriched with additional populations of Native Americans [26]. Second, we use DAPC to study the temporal variation in seasonal influenza (H3N2) hemagglutinin (HA) segments from viruses collected in the northern hemisphere from 2001 to 2007. Both datasets, as well as the implementation of our methodology are available in the *adegenet* package [6] for the free software R [27].

## Results

### Analysis of simulated datasets

Parameters of simulations.

Island model | Hierarchical island model | Hierarchical stepping stone | Stepping stone | |
---|---|---|---|---|

Number of populations | 6 | 6 (3, 2, 1) | 12 (6, 6) | 24 |

Population size | 200 | 200 | 100 | 50 |

Sample size | 100 | 100 | 50 | 25 |

Migration rate | 0.005 | 0.05/0.005 | 0.01/0.001 | 0.02 |

Mutation rate | 10 | 10 | 10 | 10 |

Number of loci | 30 | 30 | 30 | 30 |

Possible allelic states | 50 | 50 | 50 | 50 |

Summary statistics of the simulations.

Median | Quantile 5% | Quantile 95% | |
---|---|---|---|

| |||

| 0.1 | 0.07 | 0.13 |

| 0.42 | 0.36 | 0.46 |

number of alleles/locus | 5 | 3 | 8 |

| |||

| 0.05 | 0.03 | 0.08 |

| 0.41 | 0.33 | 0.49 |

number of alleles/locus | 5 | 2 | 8 |

| |||

| 0.37 | 0.09 | 0.56 |

| 0.3 | 0.2 | 0.38 |

number of alleles/locus | 6 | 3 | 9 |

| |||

| 0.42 | 0.12 | 0.64 |

| 0.27 | 0.13 | 0.36 |

number of alleles/locus | 6 | 4 | 9 |

Results of the analyses of simulated data.

Island Model | Hierarchical island model | Hierarchical stepping stone | Stepping stone | |
---|---|---|---|---|

Number of populations (true | 6 | 6 | 12 | 24 |

| 6 ([6,6]) | 6 ([6,8]) | 11 (8,12) | 17.5 ([13,21]) |

| 6 ([2,7]) | 3 ([2,6]) | 2 ([2,2]) | 2 ([2,5]) |

% of correct assignment by DAPC | 98.2% ([96.3%,99%]) | 87.5% ([73.9%,91.2%]) | 89.7% ([87.9%,97.2%]) | 83.9% ([80%,88.7%]) |

% of correct assignment by STRUCTURE | 98.6% ([98%,99.2%]) | 93.1% ([89.2%,95.5%]) | NA | NA |

*adegenet*package [6] for the R software [27]. The number of clusters was assessed using the function

*find.clusters*, which runs successive K-means clustering with increasing number of clusters (

*k*). We covered a wide range of possible clusters from one to 2

*K*, where

*K*was the actual number of demes in the simulations. Figure 3 illustrates the procedure for selecting the 'optimal' number of clusters. This choice was made on the basis of the lowest associated BIC (Figure 3a-b). In cases where the optimal number of clusters was ambiguous,

*k*was increased as long as it resulted in a noticeable improvement in BIC (Figure 3c-d). Overall, this procedure recovered well the actual number of populations (Table 3). The number of clusters was always better inferred in island-based models (Figure 3a-b) than in more continuous population genetics models (Figure 3c-d), where clusters tend to dissolve into more clinal patterns of genetic differentiation. But even when the actual

*K*was not identified, the inferred number of clusters generally remained relatively close to the true value (Table 3). Interestingly, the estimation of

*K*by our method was markedly better than that achieved by STRUCTURE for all the studied models, including the classical island model for which our approach always inferred the exact number of clusters (Table 3). This result is consistent with previous studies which used K-means on principal components [5, 16].

Then, DAPC was performed (function *dapc*) using clusters defined by K-means where we specified the actual number of clusters (*i.e.*, *k* = *K*). In all analyses, 50 principal components of PCA were retained in the data transformation step. The comparison of the final assignments of individuals to groups to the actual group memberships revealed that DAPC performed remarkably well. Assignment success varied depending on the population genetics model assumed in the simulations but remained high for all simulated datasets considered (Table 3). The frequency of correct assignments was highest in the island models, where DAPC performed essentially as well as STRUCTURE (Table 3). However, even in the stepping stone models (Figure 2c-d), successful assignment rates remained very satisfying, with correct assignment rates ranging from 80% to 97% depending on the replicate.

### Analysis of empirical data

#### Human microsatellite data

DAPC was applied to the microsatellite genotypes from the Human Genome Diversity Project-Centre d'Etude du Polymorphisme Humain (HGDP-CEPH) [23–25], an extensive dataset of native human populations distributed worldwide. This dataset was extended by adding genotypes from 24 Native American and Siberian populations [26]. The resulting dataset comprises 1350 individuals from 79 populations, genotyped for 678 microsatellite markers (8170 alleles).

#### Seasonal influenza (H3N2) hemagglutinin data

To illustrate the versatility of our approach, we selected a radically different dataset for the second example. We analysed the population structure of seasonal influenza A/H3N2 viruses using hemagglutinin (HA) sequences. Changes in the HA gene are largely responsible for immune escape of the virus (antigenic shift), and allow seasonal influenza to persist by mounting yearly epidemics peaking in winter [30–32]. These genetic changes also force influenza vaccines to be updated on a yearly basis. Influenza A virus genome is organized in eight segments analogous to chromosomes in eukaryotes. While exchanges of segments (genomic reassortment) occasionally happen during the replication of the virus in multiply infected hosts [30, 33, 34], we are unaware of evidences for within-segment recombination.

Assessing the genetic evolution of a pathogen through successive epidemics is of considerable epidemiological interest. In the case of seasonal influenza, we would like to ascertain how genetic changes accumulate among strains from one winter epidemic to the next. For this purpose, we retrieved all sequences of H3N2 hemagglutinin (HA) collected between 2001 and 2007 available from Genbank [35]. Only sequences for which a location (country) and a date (year and month) were available were retained, which allowed us to classify strains into yearly winter epidemics. Because of the temporal lag between influenza epidemics in the two hemispheres, and given the fact that most available sequences were sampled in the northern hemisphere, we restricted our analysis to strains from the northern hemisphere (latitudes above 23.4°north). DNA sequences and meta-information were retrieved from Genbank using *ad-hoc* R scripts. Alignments were obtained for a stretch of 990 bases using ClustalW [36] and further refined manually using Jalview [37]. Aligned sequences were then imported in R using the *ape* package [38], and SNPs were extracted from the sequences using *adegenet*[6]. The final dataset included 1903 strains characterized by 125 SNPs which resulted in a total of 334 alleles. All strains from 2001 to 2007 were classified into six winter epidemics (2001-2006). This was done by assigning all strains from the second half of the year with those from the first half of the following year. For example, the 2005 winter epidemic comprises all strains collected between the 1^{st} of July 2005 and the 30^{th} of June 2006.

## Discussion and Conclusions

In this paper, we introduced a new multivariate method, the Discriminant Analysis of Principal Components (DAPC), for the analysis of the genetic structure of populations. This approach can be used to define clusters of individuals and to unravel possibly complex structures existing among clusters, such as hierarchical clustering and clinal differentiation, while being orders of magnitude faster than existing Bayesian clustering methods. For simulated data, DAPC proved as accurate as STRUCTURE in detecting hidden population clusters within simple island population models. Moreover, DAPC was more suited to unravel the underlying structuring in more complex population genetics models. Another major advantage of DAPC over Bayesian clustering approaches is the possibility to generate a graphical representation of the relatedness between the inferred clusters. Applied to two highly contrasted empirical datasets, our method was able to identify non-trivial and meaningful biological patterns.

One of the main assets of DAPC is its great versatility. Indeed, DAPC does not rely on a particular population genetics model, and is thus free of assumptions about Hardy-Weinberg equilibrium or linkage disequilibrium. As such it should be useful for a variety of organisms, irrespective of their ploidy and rate of genetic recombination. Also, contrary to Bayesian clustering methods, DAPC can be applied to very large datasets within negligible computational time (all analyses presented in this paper took less than minute to run on a standard computer). Moreover, the method is not restrained to genetic data, and can be applied to any quantitative data such as morphometric data. This feature is particularly interesting as it allows for partialling out the effects of undesirable covariates, such as different sequencing protocols, or trivial genetic structures that could obscure lesser, more interesting patterns. This can be achieved by analyzing the residuals of a preliminary model including the covariates as predictors instead of the raw data.

A major concern pertaining to all clustering approaches is the risk of inferring artefactual discrete groups in populations where genetic diversity is distributed continuously. Such spurious clusters are particularly likely to arise under spatially heterogeneous sampling of populations [39, 40]. DAPC is not immune to this bias, and may indeed erroneously identify clusters within a cline. However, scatterplots provided by the method allow for a graphical assessment of the genetic structures between clusters (Figures 5 and 8), and provide remarkable insights as to how the genetic variability is organized. For instance, in our simulations based on stepping stone models (Figure 2c-d), DAPC clearly revealed the existence of clines (Figure 4c-d). Therefore, our approach is by no means restricted to the study of populations organised in discrete groups, and should be able to reveal more complex genetic patterns.

We chose to analyse two contrasted datasets to illustrate the versatility our approach. The HGDP-CEPH dataset has been repeatedly analysed using a variety of methods [29, 39, 41–47]. The DAPC results support previous evidence for discontinuities above and beyond the global clinal pattern in the apportionment of human genetic variation [29, 43, 48]. The subdivision inferred by DAPC is strikingly similar to the four clusters identified by the STRUCTURE software [25, 26, 29]. Note however, that the existence of large-scale clusters is not incompatible with a clinal distribution of genetic diversity and/or smaller-scale subdivisions [41, 43]. These results illustrate that DAPC can be used as an efficient genetic clustering tool.

In contrast, the seasonal influenza analysis highlights features that go beyond simple genetic clustering. The DAPC scatterplot reveals that the virus is genetically structured into clusters which are arranged along a temporal cline, and shows a marked discontinuity between two successive years. Examination of allele loadings further reveals that this abrupt change is due to the apparition of new alleles in the global population, one of which induced a change in the amino-acid sequence, and may have therefore been subject to natural selection.

Although DAPC is a promising tool for the analysis of genetic data, further methodological developments should be considered to improve our approach. K-means has proved very efficient here as in previous studies for identifying genetic clusters [5], and is moreover consistent with the variance partition model used in Discriminant Analysis. However, this algorithm uses a very simple measure of group differentiation, and might struggle to identify the correct clusters in the most complex situations [16]. Would that be the case, useful alternatives to K-means could be found in more elaborated clustering algorithms [49]. Another point of interest relates to the selection of the number of principal components used in the prior dimension-reduction step. So far, this procedure is largely *ad hoc*, and relies on retaining most (more than 80%) of the genetic variance. Objective criteria would be useful to achieve this task. Unfortunately, there is no consensus on the best strategy for selecting interpretable principal components in PCA [50]. In the context of DAPC, we will have to evaluate a trade-off between the power of discrimination and the stability of assignments. Retaining more principal components provides more power for unravelling genetics structures, but increases the risks of obtaining *ad hoc* combinations of alleles which would discriminate perfectly the sampled individuals, whilst performing poorly on newly sampled individuals [51]. This issue could be addressed using repeated cross-validation, so that each individual would be assigned to a cluster based on a model calibrated using other individuals.

Irrespective of these methodological adjustments, we can see applications of DAPC beyond the mere study of the genetic structure of populations. One field where the method may be particularly relevant is association studies. In this context, population structuring ('*population stratification*') creates spurious correlations between genotypes and phenotypes. To circumvent this issue, Price et al. [12] proposed to partial out population structures by regressing data onto the first principal components of a PCA. But as explained in the introduction, PCA focuses on the overall variability, which includes variation between and within populations. In this case it would be preferable to remove only between-population structures from the data. Indeed, regression onto the first principal components of a PCA is likely to remove relevant within-population variation, thereby resulting in a lack of power for detecting significant associations. In contrast, DAPC yields principal components which are meant to reflect between-population variability only. Regressing data onto these synthetic variables would therefore remove the effects of population stratification, while preserving relevant variability. Note that one could achieve the same result by regressing data onto the groups identified by our approach.

Association studies aim at identifying genetic features that differ between two or more groups of individuals. In other words, the aim is to identify the alleles that best discriminate a set of pre-defined clusters. DAPC seems perfectly adapted to this task, as it finds linear combinations of alleles (the discriminant functions) which best separate the clusters. Alleles with the largest contributions to this discrimination are therefore those which are the most markedly different across groups, which could represent cases and controls. A simple plot of allele contributions (Figure 9) could therefore be used for a graphical assessment of alleles of major interest. An additional reason why DAPC may be well suited for this purpose is the ease with which one can control for covariates, such as age or sex.

To conclude, DAPC appears as a fast, powerful and flexible tool to unravel the makeup of genetically structured populations. However, we have no doubt that the application of this method goes way beyond the illustrations provided in this paper. We hope that its implementation in the free software R [27], which hosts an ever increasing number of tools for population genetics and phylogenetics [38, 52–54] will open new and exciting perspectives for the statistical analysis of genetic data.

## Methods

### Measuring between-group differentiation

Discriminant Analysis (DA), DAPC, and K-means clustering all rely on the same statistical model to quantify between-group differentiation, which is in fact a classical ANOVA model. Below, we introduce this general model using concepts and notations further used in the specific presentation of DAPC and K-means clustering.

**y**∈ ℝ

^{ n }be the vector of a centred variable with

*n*observations (

*y*

_{1},...,

*y*

_{ n }) distributed into

*g*groups, and

**D**be the diagonal matrix containing uniform weights for the observations (

*i.e.*, all diagonal entries are 1/

*n*, while off-diagonal entries are 0). We denote

**H = [**

*h*

_{ ij }

**]**the

*n × g*matrix containing dummy vectors coding group membership, so that

*h*

_{ ij }= 1 if observation

*i*belongs to group

*j*, and

*h*

_{ ij }= 0 otherwise. We define

**P = H(H**

^{ T }

**DH)**

^{ -1 }

**H**

^{ T }

**D**as the projector onto the dummy vectors of

**H**, which can be used to replace each observation in

*y*

_{ i }by the mean value of the group to which

*i*belongs, ${\stackrel{\wedge}{y}}_{i}$. The ANOVA model relies on the decomposition of

**y**:

**I**is the identity matrix of dimension

*n*, $\stackrel{\wedge}{\mathbf{y}}$ is the vector of predictions, and $(\mathbf{y}-\stackrel{\wedge}{\mathbf{y}})$ is the vector of residuals. Since

**y**is centred, the vectors $\stackrel{\wedge}{\mathbf{y}}$ and $(\mathbf{y}-\stackrel{\wedge}{\mathbf{y}})$ are also centred, and their squared norms (${\Vert \mathbf{y}\Vert}_{\mathbf{D}}^{2},{\Vert \stackrel{\wedge}{\mathbf{y}}\Vert}_{\mathbf{D}}^{2}$, and ${\Vert \mathbf{y}-\stackrel{\wedge}{\mathbf{y}}\Vert}_{\mathbf{D}}^{2}$) equate their variances. Moreover, the Pythagorean theorem ensures that the total variance ($\mathrm{var}(\mathbf{y})={\Vert \mathbf{y}\Vert}_{\mathbf{D}}^{2}$) can be decomposed as:

**y**, we use the ratio of between-group and within-group variances, also known as the F statistic:

*correlation ratio*of

**y**, defined as:

In fact, both quantities can be used as a measure of group separation in DA and DAPC, and would yield identical results (discriminant functions) up to a constant. In the remaining, we shall refer to the F statistic only.

### Discriminant Analysis of Principal Components

**X**be a

*n × p*genetic data matrix with

*n*individuals in rows and

*p*relative frequencies of alleles in columns. For example, in the case of a locus with three alleles (A

_{1}, A

_{2}, A

_{3}), a homozygote genotype A

_{1}/A

_{1}is coded as [1, 0, 0], while a heterozygote A

_{2}/A

_{3}is coded as [0, 0.5, 0.5]. We denote

**X**

^{ j }the

*j*

^{th}allele-column of

**X**. Missing data are replaced with the mean frequency of the corresponding allele, which avoids adding artefactual between-group differentiation. Without loss of generality, we assume that each column of

**X**is centred to mean zero. Classical (linear) discriminant analysis seeks linear combinations of alleles with the form:

(**v** = [*v*_{1}...*v*_{
p
}]^{
T
}being a vector of *p* alleles loadings, known as '*discriminant coefficients*'), showing as well as possible the separation between groups as measured by the F statistic (Equation 3). That is, the aim of DA is to choose **v** so that **F(Xv)** is maximum.

*principal components*, which in the case of the discriminant analysis are also called

*discriminant functions*. Discriminant functions are found by the eigenanalysis of the

**D**-symmetric matrix [51]:

**P**is the previously defined projector onto the dummy vectors of

**H**, and

**W**is the matrix of covariances within groups, computed as:

This solution requires **W** to be invertible, which is not the case when the number of alleles *p* is greater than the number of individuals *n*. Moreover, this inverse is numerically unstable ('ill-conditioned') whenever variables are correlated, which is always the case in allele frequencies and can be worsened by the presence of linkage disequilibrium.

**X**, we first compute the principal components of PCA,

**XU**, verifying:

where **U** is a *p* × *r* matrix of eigenvectors (in columns) of **X**^{
T
}**DX**, and **Λ** the diagonal matrix of corresponding non-null eigenvalues. Note that when the number of alleles (*p*) is larger than the number of individuals (*n*), we can alternatively proceed to the eigenanalysis of **XX**^{
T
}**D** to obtain **U** and **Λ**[55], which can save considerable computational time. By definition, the number of principal components (*r*) cannot exceed the number of individuals or alleles (*r* ≤ min(*n*, *p*)), which solves the issue relating to the number of variables used in DA. Moreover, principal components are, by construction, uncorrelated, which solves the other issue pertaining to the presence of collinearity among allele frequencies.

**X**with

**XU**into Equation 6, the solution of DAPC is given by the eigenanalysis of the

**D**-symmetric matrix:

The first obtained eigenvector **v** maximizes *b*(**XUv**) under the constraint that *w*(**XUv)** = 1, which amounts to maximizing the F-statistic of **XUv**. This maximum is attained for the eigenvalue *γ* associated to **v** (*i.e.*, **F(XUv)** = *γ*). In other words, the loadings stored in the vector **v** can be used to compute the linear combinations of principal components of PCA (**XU**) which best discriminate the populations in the sense of the F-statistic.

**XU**)

**v**) can also be interpreted as linear combinations of alleles (

**X**(

**Uv**)), in which the allele loadings are the entries of the vector

**Uv**. This has the advantage of allowing one to quantify the contribution of a given allele to a particular structure. Denoting

*z*

_{ j }the loading of the

*j*

^{th}allele (

*j*= 1,...,

*p*) for the discriminant function

**XUv**, the contribution of this allele can be computed as:

### Prior clustering using K-means

**XU**) as in [5, 16], this model can be written:

**X**) = tr(Λ),

*B*(

**XU**) = tr(

**U**

^{ T }

**X**

^{ T }

**P**

^{ T }

**DPXU**), and

*W*

**(XU)**= tr(

**U**

^{ T }

**WU**). The Bayesian Information Criterion (BIC) used to choose the best clustering model is then defined as:

where *W* **(X)** is the residual variance (*i.e.*, variance within groups, Equation 2) and *g* is the number of groups. This criterion quantifies the lack of fit of the model, while penalising the number of clusters used. Note that here, *g* is used as an *ad hoc* way of avoiding overfitting, and does not estimate the parametric dimensionality of the model as in the original formulation of BIC [56]. Several K-means can be run separately with different numbers of groups, and the best runs can be inferred from the decrease of BIC. In simulated data, BIC proved more efficient for identifying the correct number of clusters than other criteria such as Akaike Information Criterion (AIC) or the adjusted R^{2} (results not shown). This result is consistent with previous findings which advocated the use of BIC for selecting the best number of groups in K-means clustering of genetic data [5].

### Clustering analyses using STRUCTURE

We used STRUCTURE [1, 2] as a benchmark for the performance of DAPC. We analysed all simulated datasets with STRUCTURE v2.1, using the admixture model with correlated allele frequencies to determine the optimal number of genetic clusters and to assign individuals to groups. Computations were performed on the computer resources of the Computational Biology Service Unit at Cornell University (http://cbsuapps.tc.cornell.edu/). For each run, results were based on a Markov Chain Monte Carlo (MCMC) of 100,000 steps, of which the first 20,000 were discarded as burn-in. Analyses were ran with numbers of clusters (*k*) ranging from 1 to 8 for the island and hierarchical island models (Figure 2a-b), from 1 to 15 for the hierarchical stepping stone (Figure 2c), and from 1 to 30 for the stepping stone (Figure 2d). Ten runs were performed for each *k* value. We employed the approach of Evanno *et al.*[57] to assess the optimal number of clusters. In order to assess assignment success, STRUCTURE was run by enforcing *k* to its true value. Individuals were assigned to clusters using CLUMPP 1.1.2 [58], which allows to account for the variability in individual membership probabilities across the different runs. To obtain results comparable to DAPC, individuals were assigned to the cluster to which they had the highest probability to belong.

### Implementation and examples

The methodological approach presented in the paper is implemented in the *adegenet* package [6] for the R software [27]. The function *find.clusters* runs successive K-means for a range of *k* values, and computes the BIC of the corresponding models. The basic K-means procedure is implemented by the function *kmeans* in the *stats* package [27]. DAPC is implemented as the function *dapc*, and relies on procedures from *ade4*[55, 59, 60] and *MASS*[61] to perform PCA (*dudi.pca*) and DA (*lda*). Both *find.clusters* and *dapc* can be used with any quantitative data, and have specific implementations for genetic data. The analysis of the four simulated datasets presented in Figures 4 and 5 can be reproduced by executing the example of the dataset *dapcIllus*. Similarly, analyses of the extended HGDP-CEPH and of the seasonal influenza (H3N2) data can be reproduced by executing the example of the datasets *eHGDP* and *H3N2,* respectively. Documentation and support can be found at the *adegenet* website (http://adegenet.r-forge.r-project.org/).

## Authors' informations

TJ is a post-doctoral research associate in biometry at the Imperial College London, UK. His main focus is on developing statistical tools for analysing genetic data, with an emphasis on multivariate methods. FB is an associate professor in population genetics at the Imperial College London, UK. His work ranges from theoretical to applied population genetics, with an emphasis on Human populations and their pathogens. SD is an assistant professor in evolutionary biology and biostatistics at the Université Claude Bernard - Lyon 1, France. His interests range from empirical studies to theoretical works in population biology, ecology, and evolution.

## Declarations

### Acknowledgements

We are grateful to our colleagues who generated the HGDP-CEPH dataset and those who made H3N2 hemagglutinin sequences publicly available on Genbank. We thank Dave Hunt, Daniel Falush, Jukka Corander, and two anonymous reviewers for providing useful comments on a previous version of the manuscript. We thank R-Forge (https://r-forge.r-project.org/) for providing a great support for the development of R packages, and the Computational Biology Service Unit at Cornell University for providing computing resources (http://cbsuapps.tc.cornell.edu/) to run the STRUCTURE analyses. We finally acknowledge financial support from the BBSRC and the MRC.

## Authors’ Affiliations

## References

- Falush D, Stephens M, Pritchard J: Inference of population structure using multilocus genotype data: Linked loci and correlated allele frequencies. Genetics. 2003, 164: 1567-1587.PubMed CentralPubMedGoogle Scholar
- Pritchard JK, Stephens M, Donnelly P: Inference of population structure using multilocus genotype data. Genetics. 2000, 155: 945-959.PubMed CentralPubMedGoogle Scholar
- Corander J, Waldmann P, Sillanpaa MJ: Bayesian analysis of genetic differentiation between populations. Genetics. 2003, 163 (1): 367-374.PubMed CentralPubMedGoogle Scholar
- Tang J, Hanage WP, Fraser C, Corander J: Identifying Currents in the Gene Pool for Bacterial Populations Using an Integrative Approach. PLoS Comput Biol. 2009, 5 (8): e1000455-10.1371/journal.pcbi.1000455.PubMed CentralView ArticlePubMedGoogle Scholar
- Lee C, Abdool A, Huang CH: PCA-based population structure inference with generic clustering algorithms. BMC Bioinformatics. 2009, 10 (S1): S73-10.1186/1471-2105-10-S1-S73.PubMed CentralView ArticlePubMedGoogle Scholar
- Jombart T: adegenet: a R package for the multivariate analysis of genetic markers. Bioinformatics. 2008, 24: 1403-1405. 10.1093/bioinformatics/btn129.View ArticlePubMedGoogle Scholar
- Jombart T, Devillard S, Dufour AB, Pontier D: Revealing cryptic spatial patterns in genetic variability by a new multivariate method. Heredity. 2008, 101 (1): 92-103. 10.1038/hdy.2008.34.View ArticlePubMedGoogle Scholar
- Jombart T, Pontier D, Dufour AB: Genetic markers in the playground of multivariate analysis. Heredity. 2009, 102 (4): 330-341. 10.1038/hdy.2008.130.View ArticlePubMedGoogle Scholar
- McVean G: A Genealogical Interpretation of Principal Components Analysis. PLoS Genet. 2009, 5 (10): e1000686-10.1371/journal.pgen.1000686.PubMed CentralView ArticlePubMedGoogle Scholar
- Novembre J, Stephens M: Interpreting principal component analysis of spatial population genetic variation. Nature Genetics. 2008, 40: 646-649. 10.1038/ng.139.PubMed CentralView ArticlePubMedGoogle Scholar
- Patterson N, Price AL, Reich D: Population structure and eigenanalysis. PLoS genetics. 2006, 2: 2074-2093. 10.1371/journal.pgen.0020190.View ArticleGoogle Scholar
- Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D: Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics. 2006, 38: 904-909. 10.1038/ng1847.View ArticlePubMedGoogle Scholar
- Hotelling H: Analysis of a complex of statistical variables into principal components (continued from September issue). The Journal of Educational Psychology. 1933, 24: 498-520. 10.1037/h0070888.View ArticleGoogle Scholar
- Hotelling H: Analysis of a complex of statistical variables into principal components. The Journal of Educational Psychology. 1933, 24: 417-441. 10.1037/h0071325.View ArticleGoogle Scholar
- Pearson K: On lines and planes of closest fit to systems of points in space. Philosophical Magazine. 1901, 2: 559-572.View ArticleGoogle Scholar
- Liu N, Zhao H: A non-parametric approach to population structure inference using multilocus genotypes. Hum Genomics. 2006, 2 (6): 353-364.PubMed CentralView ArticlePubMedGoogle Scholar
- Fisher RA: The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics. 1936, 7: 179-188.View ArticleGoogle Scholar
- Lachenbruch PA, Goldstein M: Discriminant analysis. Biometrics. 1979, 35: 69-85. 10.2307/2529937.View ArticleGoogle Scholar
- Aitchison J: The statistical analysis of compositional data. 2003, The Blackburn Press, CaldwellGoogle Scholar
- Reyment RA: The statistical analysis of multivariate serological frequency data. Bulletin of Mathematical Biology. 2005, 67: 1303-1313. 10.1016/j.bulm.2005.02.002.View ArticlePubMedGoogle Scholar
- Beharav A, Nevo E: Predictive validity of discriminant analysis for genetic data. Genetica. 2003, 119: 259-267. 10.1023/B:GENE.0000003666.33328.22.View ArticlePubMedGoogle Scholar
- Fraley C, Raftery AE: How many clusters? Which clustering method? Answers via model-based cluster analysis. Computer Journal. 1998, 41: 578-588. 10.1093/comjnl/41.8.578.View ArticleGoogle Scholar
- Cann HM, de Toma C, Cazes L, Legrand MF, Morel V, Piouffre L, Bodmer J, Bodmer WF, Bonne-Tamir B, Cambon-Thomsen A, et al: A human genome diversity cell line panel. Science. 2002, 296 (5566): 261-262. 10.1126/science.296.5566.261b.View ArticlePubMedGoogle Scholar
- Ramachandran S, Deshpande O, Roseman CC, Rosenberg NA, Feldman MW, Cavalli-Sforza LL: Support from the relationship of genetic and geographic distance in human populations for a serial founder effect originating in Africa. Proc Natl Acad Sci USA. 2005, 102 (44): 21594-15947. 10.1073/pnas.0507611102.View ArticleGoogle Scholar
- Rosenberg NA, Pritchard JK, Weber JL, Cann HM, Kidd KK, Zhivotovsky LA, Feldman MW: Genetic structure of human populations. Science. 2002, 298: 2381-2385. 10.1126/science.1078311.View ArticlePubMedGoogle Scholar
- Wang S, Lewis CM, Jakobsson M, Ramachandran S, Ray N, Bedoya G, Rojas W, Parra MV, Molina JA, Gallo C, et al: Genetic Variation and Population Structure in Native Americans. PLoS Genetics. 2007, 3 (11): e185-10.1371/journal.pgen.0030185.PubMed CentralView ArticlePubMedGoogle Scholar
- R Development Core Team: R: A Language and Environment for Statistical Computing. 2009, Vienna, Austria: R Foundation for Statistical ComputingGoogle Scholar
- Balloux F: Easypop (version 1.7): a computer program for population genetics simulations. Journal of Heredity. 2001, 92: 301-302. 10.1093/jhered/92.3.301.View ArticlePubMedGoogle Scholar
- Rosenberg NA, Mahajan S, Ramachandran S, Zhao C, Pritchard JK, Feldman MW: Clines, clusters, and the effect of study design on the inference of human population structure. PLoS genetics. 2005, 1: 0660-0671. 10.1371/journal.pgen.0010070.View ArticleGoogle Scholar
- Rambaut A, Pybus OG, Nelson MI, Viboud C, Taubenberger JK, Holmes EC: The genomic and epidemiological dynamics of human influenza A virus. Nature. 2008, 453 (7195): 615-U612. 10.1038/nature06945.PubMed CentralView ArticlePubMedGoogle Scholar
- Russell CA, Jones TC, Barr IG, Cox NJ, Garten RJ, Gregory V, Gust ID, Hampson AW, Hay AJ, Hurt AC, et al: The Global Circulation of Seasonal Influenza A (H3N2) Viruses. Science. 2008, 320 (5874): 340-346. 10.1126/science.1154137.View ArticlePubMedGoogle Scholar
- Smith DJ, Lapedes AS, de Jong JC, Bestebroer TM, Rimmelzwaan GF, Osterhaus A, Fouchier RAM: Mapping the antigenic and genetic evolution of influenza virus. Science. 2004, 305 (5682): 371-376. 10.1126/science.1097211.View ArticlePubMedGoogle Scholar
- Holmes EC, Ghedin E, Miller N, Taylor J, Bao Y, St George K, Grenfell BT, Salzberg SL, Fraser CM, Lipman DJ, et al: Whole-Genome Analysis of Human Influenza A Virus Reveals Multiple Persistent Lineages and Reassortment among Recent H3N2 Viruses. PLoS Biol. 2005, 3 (9): e300-10.1371/journal.pbio.0030300.PubMed CentralView ArticlePubMedGoogle Scholar
- Young JF, Palese P: Evolution of human influenza A viruses in nature: recombination contributes to genetic variation of H1N1 strains. Proceedings of the National Academy of Sciences of the United States of America. 1979, 76 (12): 6547-6551. 10.1073/pnas.76.12.6547.PubMed CentralView ArticlePubMedGoogle Scholar
- Benson D, Karsch-Mizrachi AI, Lipman DJ, Ostell J, Wheeler DL: GenBank. Nucleic Acids Research. 2008, 36: D25-D30. 10.1093/nar/gkm929.PubMed CentralView ArticlePubMedGoogle Scholar
- Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, et al: Clustal W and Clustal X version 2.0. Bioinformatics. 2007, 23 (21): 2947-2948. 10.1093/bioinformatics/btm404.View ArticlePubMedGoogle Scholar
- Waterhouse AM, Procter JB, Martin DMA, Clamp M, Barton GJ: Jalview Version 2--a multiple sequence alignment editor and analysis workbench. Bioinformatics. 2009, 25 (9): 1189-1191. 10.1093/bioinformatics/btp033.PubMed CentralView ArticlePubMedGoogle Scholar
- Paradis E, Claude J, Strimmer K: APE: analyses of phylogenetics and evolution in R language. Bioinformatics. 2004, 20: 289-290. 10.1093/bioinformatics/btg412.View ArticlePubMedGoogle Scholar
- Handley LJ, Manica A, Goudet J, Balloux F: Going the distance: human population genetics in a clinal world. Trends Genet. 2007, 23 (9): 432-439. 10.1016/j.tig.2007.07.002.View ArticlePubMedGoogle Scholar
- Serre D, Paabo S: Evidence for gradients of human genetic diversity within and among continents. Genome Res. 2004, 14 (9): 1679-1685. 10.1101/gr.2529604.PubMed CentralView ArticlePubMedGoogle Scholar
- Corander J, Marttinen P, Siren J, Tang J: Enhanced Bayesian modelling in BAPS software for learning genetic structures of populations. BMC Bioinformatics. 2008, 9: 539-10.1186/1471-2105-9-539.PubMed CentralView ArticlePubMedGoogle Scholar
- Francois O, Ancelet S, Guillot G: Bayesian clustering using hidden Markov random fields in spatial population genetics. Genetics. 2006, 174 (2): 805-816. 10.1534/genetics.106.059923.PubMed CentralView ArticlePubMedGoogle Scholar
- Hunley KL, Healy ME, Long JC: The Global Pattern of Gene Identity Variation Reveals a History of Long-Range Migrations, Bottlenecks, and Local Mate Exchange: Implications for Biological Race. Am J Phys Anthropol. 2009, 139 (1): 35-46. 10.1002/ajpa.20932.View ArticlePubMedGoogle Scholar
- Kittles RA, Weiss KM: Race, ancestry, and genes: Implications for defining disease risk. Annu Rev Genomics Hum Genet. 2003, 4: 33-67. 10.1146/annurev.genom.4.070802.110356.View ArticlePubMedGoogle Scholar
- Manica A, Prugnolle F, Balloux F: Geography is a better determinant of human genetic differentiation than ethnicity. Hum Genet. 2005, 118 (3-4): 366-371. 10.1007/s00439-005-0039-3.PubMed CentralView ArticlePubMedGoogle Scholar
- Prugnolle F, Manica A, Balloux F: Geography predicts neutral genetic diversity of human populations. Curr Biol. 2005, 15 (5): R159-160. 10.1016/j.cub.2005.02.038.PubMed CentralView ArticlePubMedGoogle Scholar
- Romero IG, Manica A, Handley LL, Balloux F: How accurate is the current picture of human genetic variation?. Heredity. 2009, 102: 120-126. 10.1038/hdy.2008.89.View ArticlePubMedGoogle Scholar
- Amos W, Hoffman JI: Evidence that two main bottleneck events shaped modern human genetic diversity. Proc R Soc B-Biol Sci. 2010, 277 (1678): 131-137. 10.1098/rspb.2009.1473.View ArticleGoogle Scholar
- Fraley C, Raftery AE: Model-based clustering, discriminant analysis and density estimation. Journal of the American Statistical Association. 2002, 97: 611-631. 10.1198/016214502760047131.View ArticleGoogle Scholar
- Peres-Neto PR, Jackson DA, Somers KM: How many principal components? stopping rules for dertermining the number of non-trivial axes revisited. Computational Statistics & Data Analysis. 2005, 49: 974-997.View ArticleGoogle Scholar
- Saporta G: Probabilites, analyse des donnees et statistique. 1990, Paris, TechnipGoogle Scholar
- Jombart T, Balloux F, Dray S: adephylo: new tools for investigating the phylogenetic signal in biological traits. Bioinformatics. 26 (15): 1907-1909. 10.1093/bioinformatics/btq292.
- Kembel SW, Cowan PD, Helmus MR, Cornwell WK, Morlon H, Ackerly DD, Blomberg SP, Webb CO: Picante: R tools for integrating phylogenies and ecology. Bioinformatics. 26 (11): 1463-1464. 10.1093/bioinformatics/btq166.
- Paradis E: PEGAS: an R package for population genetics with an integrated-modular approach. Bioinformatics. 2010, btp696-Google Scholar
- Dray S, Dufour AB: The ade4 package: implementing the duality diagram for ecologists. Journal of Statistical Software. 2007, 22 (4): 1-20.View ArticleGoogle Scholar
- Schwarz G: Estimating the dimension of a model. The Annals of Statistics. 1978, 6: 461-464. 10.1214/aos/1176344136.View ArticleGoogle Scholar
- Evanno G, Regnaut S, Goudet J: Detecting the number of clusters of individuals using the software structure: a simulation study. Molecular Ecology. 2005, 14: 2611-2620. 10.1111/j.1365-294X.2005.02553.x.View ArticlePubMedGoogle Scholar
- Jakobsson M, Rosenberg NA: CLUMPP: a cluster matching and permutation program for dealing with label switching and multimodality in analysis of population structure. Bioinformatics. 2007, 23: 1801-1806. 10.1093/bioinformatics/btm233.View ArticlePubMedGoogle Scholar
- Chessel D, Dufour AB, Thioulouse J: The ade4 package-I - One-table methods. R News. 2004, 4: 5-10.Google Scholar
- Dray S, Dufour AB, Chessel D: The ade4 package - II: Two-table and K-table methods. R News. 2007, 7: 47-54.Google Scholar
- Venables WN, Ripley BD: Modern Applied Statistics with S. 2002, New York: Springer, FourthView ArticleGoogle Scholar
- Nei M: Analysis of gene diversity in subdivided populations. Proc Natl Acad Sci USA. 1973, 70 (12): 3321-3323. 10.1073/pnas.70.12.3321.PubMed CentralView ArticlePubMedGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.