Skip to main content

On rare variants in principal component analysis of population stratification

Abstract

Background

Population stratification is a known confounder of genome-wide association studies, as it can lead to false positive results. Principal component analysis (PCA) method is widely applied in the analysis of population structure with common variants. However, it is still unclear about the analysis performance when rare variants are used.

Results

We derive a mathematical expectation of the genetic relationship matrix. Variance and covariance elements of the expected matrix depend explicitly on allele frequencies of the genetic markers used in the PCA analysis. We show that inter-population variance is solely contained in K principal components (PCs) and mostly in the largest K-1 PCs, where K is the number of populations in the samples. We propose FPC, ratio of the inter-population variance to the intra-population variance in the K population informative PCs, and d2, sum of squared distances among populations, as measures of population divergence. We show analytically that when allele frequencies become small, the ratio FPC abates, the population distance d2 decreases, and portion of variance explained by the K PCs diminishes. The results are validated in the analysis of the 1000 Genomes Project data. The ratio FPC is 93.85, population distance d2 is 444.38, and variance explained by the largest five PCs is 17.09% when using with common variants with allele frequencies between 0.4 and 0.5. However, the ratio, distance and percentage decrease to 1.83, 17.83 and 0.74%, respectively, with rare variants of frequencies between 0.0001 and 0.01.

Conclusions

The PCA of population stratification performs worse with rare variants than with common ones. It is necessary to restrict the selection to only the common variants when analyzing population stratification with sequencing data.

Background

Genome-wide association studies (GWAS) [1] have identified a considerable number of sequence variants, such as single nucleotide polymorphisms (SNPs), associated with human diseases or traits. Population stratification—allele frequencies of genetic markers of the studied samples having significant differences owing to systematic ancestry differences—can cause false positive results in case-control as well as cohort studies [2, 3]. Association mapping based on rare variants are much more susceptible to subtle effects of population stratification and therefore, more likely to yield false positive results [4]. From a population genetics point of view, exploring population structure is important for understanding the evolutionary history of populations. Many methods and software are available to study the population stratification, such as the principal component analysis (PCA) implemented in EIGENSOFT [5, 6], the multidimensional scaling analysis in PLINK [7], the clustering analysis in STRUCTURE [8, 9], and fastSTRUCTURE [10]. Recently, controlling population stratification in the association analysis using linear mixed models [11,12,13,14] was also suggested. Based on a large number of common variants whose minor allele frequencies (MAFs) are larger than 5%, the PCA of population structure is widely applied in GWAS.

With the advance of high-throughput sequencing technology, as well as the enormous reduction of the cost, it is capable and affordable in genetic studies to detect additional low-frequency and rare variants (MAF < 1%) [15]. Furthermore, existing sequencing data suggest that the vast majority of rare variants are population-specific. In the 1000 Genomes Project data [16, 17], there are a total of 77 million biallelic SNPs, among which 65 million are rare and 52 million are polymorphic in one of the five continental ancestry populations: East Asian (EAS), South Asian (SAS), African (AFR), European (EUR), American (AMR). It seems that rare variants are more informative in distinguishing population structure than common ones. However, the efficacy of using rare variants in population stratification analysis remains controversial [18,19,20,21].

A number of efforts have been made concerning the use of low-frequency and rare variants in population stratification analysis. Baye et al. illustrated that more fine substructure can be detected using rare variants and suggested that more SNPs are required to account for a similar level of population structure using rare variants compared to common ones [18]. Siu et al. showed that rare variants have a much higher power to identify population substructure than common variants [19]. In contrast, Zhang et al. demonstrated that PCAs based on common and less-frequency SNPs perform better than those based on rare ones in separating European and African samples [20]. The authors further concluded that there is little added value for PCA of population stratification with rare variants only [21]. All existing work was based on analysis of genotype data from the 1000 Genomes Project with known population structure.

In this work, we investigate how rare variants affect PCA of population stratification from a theoretical perspective. We derive mathematical expectation of the genetic relationship matrix (GRM) [22]. The GRM is commonly computed from the observed genotypes and eigen-decomposed in the analysis of population stratification. Elements of the expected genetic relationship matrix (EGRM), however, depend explicitly on the allele frequencies of the markers used. We show that inter-population variance is solely contained in K principal components (PCs) and mostly in the largest K-1 PCs, where K is the number of populations in the sample. We propose FPC, ratio of the inter-population variance to the intra-population variance in the K population informative PCs, and d2, sum of squared distances among populations, as measures of population divergence. We show analytically that when allele frequencies become small, the ratio FPC abates, the population distance d2 decreases, and portion of variance explained by the K PCs diminishes. Therefore, the PCA of population stratification performs worse with rare variants than with common ones. The results are further validated in the analysis of the 1000 Genomes Project data with 2504 individuals from five continental populations.

Methods

Genetic relationship matrix

In the scenario where genotype data of individuals is sampled from K populations, there are Nk individuals in population k and the number of individuals in the total population is N = N1 + N2 +  + NK. We have M SNPs, whose frequencies of their coded alleles in population k are [ fk1, fk2, , fkM]. Let X be the genotype matrix of dimension N × M. The genotypic value X(n,m) is the number of the coded allele of SNP m for individual n, where n = 1, 2, , N and m = 1, 2, , M. Typically, the number of individuals is less than the number of markers, i.e. N < M. We assume that all SNPs are under the Hardy-Weinberg equilibrium in each population. The GRM can be calculated as

$$ \boldsymbol{Z}=\frac{1}{M}\boldsymbol{Y}{\boldsymbol{Y}}^{\mathrm{T}}, $$
(1)

where each entry of Y is a normalized version of the coded genotype in X, i.e.

$$ \boldsymbol{Y}\left(n,m\right)=\frac{\boldsymbol{X}\left(n,m\right)-{\mu}_m}{\sigma_m} $$
(2)

for n = 1, 2, , N and m = 1, 2, , M. Here, μm and σm denote the genotypic mean and standard deviation of SNP m in the total population, respectively. It can be shown that μm and σm relate to the population structure and allele frequencies as follows (Supplemental Text S1)

$$ {\mu}_m=2\frac{\sum_{k=1}^K{N}_k{f}_{km}}{N}, $$
(3)
$$ {\sigma}_m^2=2\frac{\sum_{k=1}^K{N}_k{f}_{km}\left(1-{f}_{km}\right)}{N}+4\frac{\sum_{k=1}^K{\sum}_{{}_{k\ne l}{}^{l=1}}^K{N}_k{N}_l}{N^2}{\left({f}_{km}-{f}_{lm}\right)}^2. $$
(4)

The coded-allele frequency of SNP m in the total population can be found as fm = μm/2, where m = 1, 2, , M. The GRM is of dimension N × N, whose diagonal elements are genotypic variance of individuals and off-diagonal elements are genotypic covariance between two individuals. It should be noted that genotypes follow mixed binomial distributions, and elements of Z are sample variances and covariance computed from the genotype data. The PCA analysis of population stratification is based on the eigen-analysis of the observed GRM Z.

In practice, μm and σm are unknown, and therefore sample mean \( {\hat{\mu}}_m \) and sample standard deviation \( {\hat{\sigma}}_m \) or some other quantities similar are used instead. Usually, \( {\hat{\mu}}_m=2{\hat{f}}_m \) is used for the centralization in (2). In EIGENSOFT, \( \sqrt{{\hat{f}}_m\left(1-{\hat{f}}_m\right)} \) is adopted for the normalization in (2), while \( \sqrt{2{\hat{f}}_m\left(1-{\hat{f}}_m\right)} \) is employed in GCTA [22]. Different estimates of the allele frequency fm were suggested as well [5, 6]. In the following theoretical derivations, we assume that μm and σm are known for the sake of simplicity. This will bring about clear connections between variance and covariance elements of the EGRM and allele frequencies of the SNPs used in the analysis. The connections further provide clues and insights for understanding the effect of rare variants on the PCA of population stratification.

Expected genetic relationship matrix

We assume that all individuals are unrelated. When the number of markers M goes large, the sample variance and covariance elements in the GRM will converge to their mathematical expectations in probability due to the law of large numbers. We denote the EGRM as Z, which is the expectation of the GRM Z. Without loss of generality, we assume that individuals are ordered according to their population memberships. As such, the first N1 rows and columns correspond to individuals from population 1, the next N2 rows and columns are from population 2, and so on. Thus, the EGRM can be partitioned into a block matrix consisting of K × K submatrices

$$ Z=\left(\begin{array}{cccc}{Z}_{11}& {Z}_{12}& \dots & {Z}_{1K}\\ {}{Z}_{12}^{\mathrm{T}}& {Z}_{22}& \dots & {Z}_{2K}\\ {}\vdots & \vdots & \ddots & \vdots \\ {}{Z}_{1K}^{\mathrm{T}}& {Z}_{2K}^{\mathrm{T}}& \dots & {Z}_{KK}\end{array}\right). $$
(5)

Diagonal sub-matrices of the EGRM Z have the following structure

$$ {Z}_{kk}=\left(\begin{array}{cccc}{z}^k& {z}^{kk}& \dots & {z}^{kk}\\ {}{z}^{kk}& {z}^k& \dots & {z}^{kk}\\ {}\vdots & \vdots & \ddots & \vdots \\ {}{z}^{kk}& {z}^{kk}& \dots & {z}^k\end{array}\right),k=1,2,...,K. $$
(6)

Here, diagonal elements of the submatrix Zkk are of the mathematical form

$$ {z}^k=\frac{1}{M}\sum \limits_{m=1}^M\frac{2{f}_{km}\left(1-{f}_{km}\right)+{\left(2{f}_{km}-{\mu}_m\right)}^2}{\sigma_m^2} $$
(7)

which is the genotypic variance of individuals from population k. All off-diagonal elements share the form

$$ {z}^{kk}=\frac{1}{M}\sum \limits_{m=1}^M\frac{{\left(2{f}_{km}-{\mu}_m\right)}^2}{\sigma_m^2} $$
(8)

which is the genotypic covariance between two individuals from population k.

The off-diagonal sub-matrices of the EGRM Z are structured as follows

$$ {Z}_{kl}=\left(\begin{array}{cccc}{z}^{kl}& {z}^{kl}& \dots & {z}^{kl}\\ {}{z}^{kl}& {z}^{kl}& \dots & {z}^{kl}\\ {}\vdots & \vdots & \ddots & \vdots \\ {}{z}^{kl}& {z}^{kl}& \dots & {z}^{kl}\end{array}\right),k\ne l. $$
(9)

Elements of Zkl share the value

$$ {z}^{kl}=\frac{1}{M}\sum \limits_{m=1}^M\frac{\left(2{f}_{km}-{\mu}_m\right)\left(2{f}_{lm}-{\mu}_m\right)}{\sigma_m^2} $$
(10)

which is the genotypic covariance between one individual from population k and one from population l. Details of the derivations are presented in Supplemental Text S2.

The EGRM Z, the mathematical expectation of GRM Z, depends only on the population sizes N1, N2, , NK and allele frequencies of the M SNPs in K populations [ fk1, fk2, , fkM], k = 1, 2, , K. Here, we treat the SNP number M and the allele frequencies as fixed numbers. A theoretical formulation of the PCA that considers genotypic values as a random vector and allele frequencies in different populations being random was proposed in Ma and Amos, 2010 [23]. Based on different assumptions, a genotypic variance-covariance matrix of the same structure was attained; nevertheless, elements of the EGRM are different from those of the variance-covariance matrix in [23].

Rare variants on the eigenvalues

Carrying out eigen-decomposition on the EGRM, it can be shown that there are Nk − 1 eigenvalues of value zk − zkk, for k = 1, 2, , K. Here,

$$ {z}^k-{z}^{kk}=\frac{1}{M}{\sum}_{m=1}^M\frac{2{f}_{km}\left(1-{f}_{km}\right)}{\sigma_m^2}. $$

The sum of the N − K eigenvalues is

$$ {\sum}_{k=1}^K\left({N}_k-1\right)\left({z}^k-{z}^{kk}\right)=\frac{1}{M}{\sum}_{m=1}^M{\sum}_{k=1}^K\frac{2\left({N}_k-1\right){f}_{km}\left(1-{f}_{km}\right)}{\sigma_m^2} $$
(11)

Clearly, variations in the N − K PCs are entirely intra-population variance of the K populations. The sum of the other K eigenvalues is

$$ \sum \limits_{k=1}^K{\lambda}_k={\sum}_{k=1}^K{N}_k{z}^{kk}+{\sum}_{k=1}^K\left({z}^k-{z}^{kk}\right)={\sigma}_{\mathrm{B}}^2+{\sigma}_{\mathrm{W}}^2, $$
(12)

where

$$ {\sigma}_{\mathrm{B}}^2={\sum}_{k=1}^K{N}_k{z}^{kk}=\frac{1}{M}{\sum}_{m=1}^M{\sum}_{k=1}^K\frac{N_k{\left(2{f}_{km}-{\mu}_m\right)}^2}{\sigma_m^2} $$

represents the inter-population variance component and

$$ {\sigma}_{\mathrm{W}}^2={\sum}_{k=1}^K\left({z}^k-{z}^{kk}\right)=\frac{1}{M}{\sum}_{m=1}^M{\sum}_{k=1}^K\frac{2{f}_{km}\left(1-{f}_{km}\right)}{\sigma_m^2} $$

stands for the intra-population variance component. Here, the intra-population covariance zkk of the EGRM factor in the K PCs as the inter-population variance after the eigen-decomposition. Note that all inter-population variance is solely contained in the K PCs. Hence, it is sufficient to conduct the population stratification analysis based on the K PCs alone.

Given a set of SNPs, the divergence among the K populations can be measured by the ratio of the two variance components, i.e.

$$ {\mathrm{F}}_{\mathrm{PC}}=\frac{\sigma_{\mathrm{B}}^2}{\sigma_{\mathrm{W}}^2}. $$
(13)

Its normalized version can be defined as

$$ {\mathrm{F}}_{\mathrm{PC}}^{\ast }=\frac{\sigma_{\mathrm{B}}^2}{\sigma_{\mathrm{B}}^2+{\sigma}_{\mathrm{W}}^2}, $$
(14)

which measures the portion of inter-population variance in the K population informative PCs and takes a value between 0 and 1. The larger the FPC and \( {\mathrm{F}}_{\mathrm{PC}}^{\ast } \) are, the more divergence among the populations.

Note that \( {\mu}_m=2{f}_m=\frac{2}{N}{\sum}_{k=1}^K{N}_k{f}_{km} \), terms in \( {\sigma}_{\mathrm{B}}^2 \) are quadratic functions of fkm, k = 1, 2, , K, m = 1, 2, , M. However, terms in \( {\sigma}_{\mathrm{W}}^2 \) are linear and quadratic functions of the frequencies. Therefore, FPC will decrease when frequencies of the coded alleles become smaller, see Supplemental Text S3 for more details. As a result, instead of improving the population stratification analysis, using rare variants will deteriorate the analysis performance. Meanwhile, since \( {\sigma}_{\mathrm{B}}^2 \) decreases faster than \( {\sigma}_{\mathrm{W}}^2 \), the K eigenvalues will be closer to the other N − K eigenvalues when frequencies of the coded alleles become smaller. Thus, the percent of variance explained by the K PCs will be smaller.

It can be shown that among the K eigenvalues, K − 1 are of large values and one small. When intra-population variance zk − zkk of the K populations are equal, all inter-population variance is contained in the largest K − 1 eigenvalues. In addition, when the sample size is large and the portions of populations remain, inter-population variance contained in the small eigenvalue is negligible, almost all information on the population structure is contained in the largest K − 1 PCs.

For cases with two populations, it can be shown that the two eigenvalues are

$$ {\lambda}_1=\frac{N_1{z}^{11}}{2}+\frac{N_2{z}^{22}}{2}+\frac{z^1-{z}^{11}}{2}+\frac{z^2-{z}^{22}}{2}+\frac{\sqrt{a}}{2}, $$
$$ {\lambda}_2=\frac{N_1{z}^{11}}{2}+\frac{N_2{z}^{22}}{2}+\frac{z^1-{z}^{11}}{2}+\frac{z^2-{z}^{22}}{2}-\frac{\sqrt{a}}{2}, $$

where

$$ a={\left[\left({z}^1-{z}^{11}\right)-\left({z}^2-{z}^{22}\right)+{N}_1{z}^{11}-{N}_2{z}^{22}\right]}^2+4{N}_1{N}_2{\left({z}^{12}\right)}^2. $$

When inter-population variance of the two populations are equal, i.e. \( {z}^1-{z}^{11}={z}^2-{z}^{22}={\sigma}_{\mathrm{W}}^2/2 \), we have

$$ {\lambda}_1={N}_1{z}^{11}+{N}_2{z}^{22}+\frac{\sigma_{\mathrm{W}}^2}{2}, $$
$$ {\lambda}_2=\frac{\sigma_{\mathrm{W}}^2}{2}. $$

That is, all information on the population structure is contained in the largest PC. All proofs are presented in Supplemental Text S4.

Rare variants on the population distance

Suppose that xk, k = 1, 2, , K are the eigenvectors associated with the K eigenvalues containing inter-population variance. We can represent each individual as a point in the K-dimension space. Vector \( \sqrt{\lambda_k}{\boldsymbol{x}}_k \) consists of coordinates of N individuals in the k-th dimension. Average value \( \sqrt{\lambda_k}{\boldsymbol{x}}_k^{\mathrm{T}}{\mathbf{1}}_N/N \) represents center of the total population in the k-th dimension, where 1N is a column vector of dimension N and with each element as 1. Due to the structure of Z, individuals from the same population share the same coordinates in the K-dimension space, and the common points denote the representative points of the populations, or centers of the populations [23]. We define

$$ {d}^2={\sum}_{k=1}^K{\left[\sqrt{\lambda_k}{\boldsymbol{x}}_k-\left(\sqrt{\lambda_k}{\boldsymbol{x}}_k^{\mathrm{T}}{\mathbf{1}}_N/N\right){\mathbf{1}}_N\right]}^{\mathrm{T}}\left[\sqrt{\lambda_k}{\boldsymbol{x}}_k-\left(\sqrt{\lambda_k}{\boldsymbol{x}}_k^{\mathrm{T}}{\mathbf{1}}_N/N\right){\mathbf{1}}_N\right] $$
$$ ={\sum}_{k=1}^K{\lambda}_k-\frac{1}{N}{\mathbf{1}}_N^{\mathrm{T}}Z{\mathbf{1}}_N $$
$$ ={\sum}_{k=1}^K{\lambda}_k-\frac{1}{N}{\sum}_{k=1}^K{N}_k\left({z}^k-{z}^{kk}\right) $$
(15)

which measures the population divergence as the sum of squared distances among populations. The proof is shown in Supplemental Text S5. Here, the second term in (15) is an average intra-population variance. As explained earlier that when allele frequencies become smaller, the K eigenvalues decrease. Hence, the population distance d2 is smaller when using rare SNPs compared to common ones.

The 1000 genomes project data

We used genotype data from the 1000 Genomes Project to validate our theoretical results. Genotype data used in this work was obtained from Phase 3 version 5a of the 1000 Genomes Project [16, 17], which contains 84.4 million genetic markers and 2504 individuals from EUR, EAS, SAS, AFR and AMR. We extracted biallelic SNPs that are polymorphic in the total population. In summary, there are 77,279,863 SNPs; 5,261,820 are common (0.1 < MAF ≤ 0.5), 6,770,457 are low-frequency (0.01 < MAF ≤ 0.1), and 65,247,586 are rare (0.0001 < MAF ≤ 0.01). Genotype data were converted to PLINK format with VCFtools [24]. The SNPs were divided into six frequency bins according to their MAFs, as shown in Table 1. For each bin, we randomly sampled approximately 100,000 SNPs using PLINK for the population stratification analyses. Here, MAFs of the SNPs in the total population were used, hence their frequencies in the five populations may be different and may not be in the same bins as in the total population. For SNPs in bin 5 and 6, they are polymorphic in the total population and may not be polymorphic in all of the five populations. PCAs were carried out, with GRMs computed by EIGENSOFT and PCAs on EGRMs conducted using GCTA. Default parameters were used when analyzing with EIGENSOFT, which excluded 68 and 116 outliers in the analyses of the data from frequency bin 5 and 6, respectively.

Table 1 Summary of SNPs from the 1000 Genomes Project data

Results

Theoretical and empirical EGRMs

To calculate the theoretical results (5)–(10), we computed MAFs of the SNPs with PLINK. Values of variance zk and covariance zkk, zkl, k, l = 1, 2, K, were calculated as in (7), (8), and (10), respectively, in which μm was computed with (3) and \( {\sigma}_m^2=2{f}_m\left(1-{f}_m\right) \), m = 1, 2, M. Values of zk and zkk for the five populations with SNPs from the six frequency bins are presented in Table 2. Absolute values of the inter-population covariance zkl are much smaller and the results are shown in Supplemental Tables S16.

Table 2 Theoretical and empirical values of the variance and covariance elements of EGRMs

To obtain the empirical values of variance zk, as well as covariance zkk and zkl, we first computed GRMs with SNPs from the six bins using EIGENSOFT. Each GRM included N(N + 1)/2 variance and covariance terms of N individuals based on the observed genotype data. Empirical value of zk was computed as the average variance of the Nk individuals from population k. The empirical value of zkk is the average covariance of Nk(Nk − 1)/2 pairs of individuals from population k. Lastly, the value of zkl is the average covariance of NkNl pairs of individuals, one from population k and one from population l. The results of zk and zkk are shown in Table 2, and those of zkl are presented in Supplemental Tables S16.

We can see that across the six frequency bins, theoretical values of zk, zkk, and zkl predicted by (7), (8), and (10), respectively, are close to their empirical values. When MAFs of the SNPs become smaller, intra-population covariance zkk decreases. For example, zkk was 0.2 for EAS with SNPs whose MAFs are between 0.4 and 0.5, which reduced to 0.003 in the sixth bin that included rare SNPs only. A similar pattern can be observed for the other four populations. FPC was estimated by (13) for the six bins, where empirical values of zk and zkk were used. The FPC decreases from 93.85 in bin 1 to 55.01 in bin 5, and further to 1.83 in bin 6. Thus the divergence among the populations is much larger when measured by common SNPs than by rare ones.

PCAs of the 1000 genomes project data

With genotypes of SNPs from each frequency bin, we carried out PCAs of population stratification by EIGENSOFT, which was essentially based on the eigen-analysis of the observed GRMs. Scatter plots of the largest three PCs are shown in Figs. 1 and 2, where eigenvectors were scaled by square roots of their corresponding eigenvalues.

Fig. 1
figure1

Scatter plots and representative points with SNPs from six MAF bins, PC 1 vs. PC 2. (a) 0.4 < MAF ≤ 0.5 (b) 0.3 < MAF ≤ 0.4 (c) 0.2 < MAF ≤ 0.3 (d) 0.1 < MAF ≤ 0.2 (e) 0.01 < MAF ≤ 0.1 (f) 0.0001 < MAF ≤ 0.01. EUR: European, EAS: East Asian, AMR: American, SAS: South Asian, AFR: African. The first values in brackets are the percentages of variance explained from the PCAs of GRMs; and the second values are from the PCAs of EGRMs. Large symbols in black are the representative points of the five populations

Fig. 2
figure2

Scatter plots and representative points with SNPs from six MAF bins, PC 1 vs. PC 3. (a) 0.4 < MAF ≤ 0.5 (b) 0.3 < MAF ≤ 0.4 (c) 0.2 < MAF ≤ 0.3 (d) 0.1 < MAF ≤ 0.2 (e) 0.01 < MAF ≤ 0.1 (f) 0.0001 < MAF ≤ 0.01. EUR: European, EAS: East Asian, AMR: American, SAS: South Asian, AFR: African. The first values in brackets are the percentages of variance explained from the PCAs of GRMs; and the second values are from the PCAs of EGRMs. Large symbols in black are the representative points of the five populations

From Figs. 1 and 2, we can see patterns of population structure computed with common and less-frequency SNPs. For example, Figs. 1a-e and 2a-e displayed similar patterns, whereas the scatter plots based on rare SNPs differed significantly. For example, AMR and SAS are separated mostly by the third PC with common SNPs, while they are distinguished by the second PC with rare ones. The third PC from rare SNPs reveals mostly substructure of AFR, likely because more rare SNPs are polymorphic in AFR than in other populations. Portions of variance explained by the largest five PCs decrease from 17.09% in bin 1 to 10.41% in bin 5, and it falls dramatically to 0.74% with rare SNPs only. As a result, the five populations are more closely distributed around the origin in Figs. 1f and 2f, compared with those in Figs. 1a-e and 2a-e. Clearly, common variants show much better performance in dissecting the population structure than rare variants do.

PCAs of EGRMs

For each frequency bin, we also constructed a EGRM with structure as described in (5), (6), and (9), whose variance and covariance elements were their theoretical values calculated by (7), (8), and (10), respectively. We conducted PCAs of the EGRMs using GCTA, and scatter plots of the largest three PCs shown in Figs. 1 and 2. Large symbols in black are representative points or centers of the five continental populations from eigen-analyses of the EGRMs. Similarly, coordinates were scaled by square roots of their eigenvalues.

Upon comparing the representative points in Figs. 1 and 2, we can see that distances between populations decrease as the SNPs change from common to rare. Sum of the squared distance d2 was calculated for the six frequency bins by (15), where λk, k = 1, 2, K were the eigenvalues of the EGRM Z and zk, zkk, k = 1, 2, K were their theoretical values. The d2 decreases from 444.38 in bin 1 to 254.10 in bin 5, and further to 17.83 in bin 6.

In addition, when portions of variance explained by the PCs become small, deviations between the representative points of the populations and true centers of the populations can be observed. This is particularly evident in the scatter plots with rare SNPs. In the PCAs of a single population, such deviations are more obvious when percents of variance explained by the largest PCs are much smaller.

Discussion

We showed that all information about the population structure is contained in K PCs. Genotypic variance explained by the K PCs can be further decomposed into the intra-population variance \( {\sigma}_{\mathrm{W}}^2 \) and inter-population variance \( {\sigma}_{\mathrm{B}}^2 \). Using more SNPs will improve convergence of the realized GRM to its mathematical expectation, i.e. the EGRM. As a result, individuals belonging to the same population will be more closely distributed around the representative point or center of the population on the PC-PC plots. On the other hand, note that \( {\sigma}_{\mathrm{B}}^2 \) is the average inter-population variance contributed from M SNPs. When rare variants are used, adding more SNPs will not increase the average level of \( {\sigma}_{\mathrm{B}}^2 \), hence neither the ratio FPC nor the sum of squared distances d2 will improve. For same reason, using a combination of common and rare SNPs will result in lower FPC and d2 compared with using common SNPs only and therefore result in worse performance.

In the case where there is one SNP, our FPC and \( {\mathrm{F}}_{\mathrm{PC}}^{\ast } \) can be further reduced to

$$ {\mathrm{F}}_{\mathrm{PC}}=\frac{{\mathrm{F}}_{\mathrm{PC}}^{\ast }}{1-{\mathrm{F}}_{\mathrm{PC}}^{\ast }}=2\frac{\sum_{k=1}^K{N}_k{\left({f}_k-f\right)}^2}{\sum_{k=1}^K{f}_k\left(1-{f}_k\right)}, $$

where fk is the allele frequency in the population k, and f the frequency in the total population. The classical Wright’s fixation index FST is widely used to gauge population stratification [25], which measures the deviation from Hardy-Weinberg equilibrium at the total population level. In this case, it can be shown that

$$ \frac{{\mathrm{F}}_{\mathrm{ST}}}{1-{\mathrm{F}}_{\mathrm{ST}}}=\frac{\sum_{k=1}^K{N}_k{\left({f}_k-f\right)}^2}{\sum_{k=1}^K{N}_k{f}_k\left(1-{f}_k\right)}. $$

Therefore, our \( {\mathrm{F}}_{\mathrm{PC}}^{\ast } \) is much larger than FST. It is worth pointing out that we assign numeric values to genotypes as numbers of the coded alleles, hence our results are dependent on such coding scheme. FST, however, does not involve in the numeric coding of genotypes. Note also that \( {\mathrm{F}}_{\mathrm{PC}}^{\ast } \) measures the portion of inter-population variance in the K population informative PCs. After the eigen-decomposition, most of the intra-population variance associated with the other N-K PCs was excluded. If our FPC were defined as the ratio of the inter-population variance to the intra-population variance in the N PCs, it would be related to FST as FPC = 2FST/(1 − FST).

In GWAS, it is a common practice to conduct population stratification analyses using a large number of random markers [26], which usually yields satisfactory results. As shown in this work, the capability of dissecting population structure depends on the allele frequencies of markers used in the analyses, and common variants perform much better than rare ones. This is not much of a concern for GWAS because SNP panels implemented in the genotyping platforms are mostly of common ones. In sequencing studies, however, the majority of the called variants are rare, and selecting SNPs randomly will yield a large portion of rare SNPs, which will deteriorate the analysis performance. Therefore, it is necessary to restrict the selection to only the common SNPs when analyzing population stratification with sequencing data. This would also be true for controlling population stratification based on the linear mixed models [11,12,13,14].

In this work, we assumed that μm and σm are known constants in (2) in order to simplify the theoretical derivations. Our results are approximations of those when estimates of the two quantities are used. When sample size N is large, variations associated with \( {\hat{\mu}}_m \) and \( {\hat{\sigma}}_m \) are much smaller than those with the genotype data. Therefore, the mathematical expectations are largely taken with respect to the genotypes and difference between the two sets of results would be small. As shown in Table 2, the predicted values of the EGRM are close to their empirical values in the 1000 Genome Project data. We carried out additional simulation studies to evaluate the effect of lacking knowledge on μm and σm. We randomly chose one SNP from each of the six frequency bins (Supplemental Table S7). Based on their MAFs observed in the five populations of the 1000 Genomes Project, we simulated genotypes of five populations each with 500 individuals. Values of μm and σm were computed with the assumption of known population structure and MAF information, and theoretical values of zk and zkk were then calculated. For comparison, we first estimated \( {\hat{\mu}}_m \) and \( {\hat{\sigma}}_m \) from the simulated genotype data. Y(n, m) were normalized with \( {\hat{\mu}}_m \) and \( {\hat{\sigma}}_m \), and zk and zkk were obtained as averages of sample variance and covariance from 1000 replicates. The two sets of results are presented in Supplemental Table S8 and the differences between the two sets of results are negligible except for small differences in the results with the rare SNP.

Inferring population structure based on a large number of genome-wide markers are likely to include markers in linkage disequilibrium (LD). Practical concerns on the LD and choice of markers were extensively discussed in [5]. It is worth noting that each marker contributes to the elements in GRM additively, see eqs. S1–3 in Supplemental Text S2. Because of the linearity of expectation, our EGRM formulae as well as the eigen-analysis on the EGRM still hold when LD exists among markers. When the number of markers goes large, convergence of the GRM to EGRM will be slower with LD among markers, compared with the case that independent markers are used. Since there are always limited number of markers in the PCA practice, our EGRM and the eigen-analysis on it represent asymptotic results of the real PCA analysis.

Despite the fact that the vast majority of rare variants are population-specific, we showed that performance of the PCA of population stratification is better when based on common SNPs rather than rare ones. On the other hand, the PCA results with rare SNPs do reveal a population structure that differs from that of common SNPs. Existing methods may not exploit ancestry information embedded in the rare variants efficiently, and different approaches from those applied to common variants should be developed [26].

Conclusions

To quantify population divergence as a function of allele frequencies of genetic markers used in the PCA analysis, we derived the expected genetic relationship matrix. We proposed FPC, ratio of the inter-population variance to the intra-population variance, and population distance d2 as measures of population divergence. Our theoretical results as well as the analyses of the 1000 Genomes Project data showed that employing rare variants yielded smaller FPC in the K population informative PCs, smaller d2, and smaller portion of variance explained by the K PCs than those using common variants. Therefore, the PCA of population stratification performs worse with rare variants than with common ones. When analyzing population stratification with sequencing data, it is necessary to restrict the selection to only the common variants.

Availability of data and materials

The datasets analyzed during the current study are available at https://www.internationalgenome.org. The accession number at https://www.ebi.ac.uk/ena is PRJNA262923.

Abbreviations

GWAS:

Genome-wide association study

LD:

Linkage disequilibrium

MAF:

Minor allele frequency

PC:

Principal component

PCA:

Principal component analysis

SNP:

Single nucleotide polymorphism

References

  1. 1.

    McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JP, Hirschhorn JN. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet. 2008;9(5):356–69.

    CAS  PubMed  Article  Google Scholar 

  2. 2.

    Freedman ML, Reich D, Penney KL, McDonald GJ, Mignault AA, Patterson N, Gabriel SB, Topol EJ, Smoller JW, Pato CN, Pato MT, Petryshen TL, Kolonel LN, Lander ES, Sklar P, Henderson B, Hirschhorn JN, Altshuler D. Assessing the impact of population stratification on genetic association studies. Nat Genet. 2004;36(4):388–93.

    CAS  PubMed  Article  Google Scholar 

  3. 3.

    Tian C, Gregersen PK, Seldin MF. Accounting for ancestry: population substructure and genome-wide association studies. Hum Mol Genet. 2008;17(R2):R143–50.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  4. 4.

    Mathieson I, McVean G. Differential confounding of rare and common variants in spatially structured populations. Nat Genet. 2012;44(3):243–6.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  5. 5.

    Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006;38(8):904–9.

    CAS  PubMed  Article  Google Scholar 

  6. 6.

    Patterson N, Price AL, Reich D. Population structure and eigenanalysis. PLoS Genet. 2006;2(12):e190.

    PubMed  PubMed Central  Article  Google Scholar 

  7. 7.

    Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, Sham PC. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81(3):559–75.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  8. 8.

    Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics. 2000;155(2):945–59.

    CAS  PubMed  PubMed Central  Google Scholar 

  9. 9.

    Falush D, Stephens M, Pritchard JK. Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics. 2003;164(4):1567–87.

    CAS  PubMed  PubMed Central  Google Scholar 

  10. 10.

    Raj A, Stephens M, Pritchard JK. fastSTRUCTURE: variational inference of population structure in large SNP data sets. Genetics. 2014;197(2):573–89.

    PubMed  PubMed Central  Article  Google Scholar 

  11. 11.

    Yu J, Pressoir G, Briggs WH, Vroh Bi I, Yamasaki M, Doebley JF, McMullen MD, Gaut BS, Nielsen DM, Holland JB, Kresovich S, Buckler ES. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat Genet. 2006;38(2):203–8.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  12. 12.

    Zhang Z, Ersoz E, Lai CQ, Todhunter RJ, Tiwari HK, Gore MA, Bradbury PJ, Yu J, Arnett DK, Ordovas JM, Buckler ES. Mixed linear model approach adapted for genome-wide association studies. Nat Genet. 2010;42(4):355–60.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  13. 13.

    Kang HM, Sul JH, Service SK, Zaitlen NA, Kong SY, Freimer NB, Sabatti C, Eskin E. Variance component model to account for sample structure in genome-wide association studies. Nat Genet. 2010;42(4):348–54.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  14. 14.

    Zhou X, Stephens M. Efficient multivariate linear mixed model algorithms for genome-wide association studies. Nat Methods. 2014;11(4):407–9.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  15. 15.

    Goodwin S, McPherson JD, McCombie WR. Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet. 2016;17(6):333–51.

    CAS  PubMed  Article  Google Scholar 

  16. 16.

    1000 Genomes Project Consortium, Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, Gibbs RA, Hurles ME, McVean GA. A map of human genome variation from population-scale sequencing. Nature. 2010;467(7319):1061–73.

    Article  Google Scholar 

  17. 17.

    1000 Genomes Project Consortium, Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT, McVean GA. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491(7422):56–65.

    Article  Google Scholar 

  18. 18.

    Baye TM, He H, Ding L, Kurowski BG, Zhang X, Martin LJ. Population structure analysis using rare and common functional variants. BMC Proc. 2011;5(Suppl 9):S8.

    PubMed  PubMed Central  Article  Google Scholar 

  19. 19.

    Siu H, Jin L, Xiong M. Manifold learning for human population structure studies. PLoS One. 2012;7(1):e29901.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  20. 20.

    Zhang Y, Guan W, Pan W. Adjustment for population stratification via principal components in association analysis of rare variants. Genet Epidemiol. 2013;37(1):99–109.

    PubMed  Article  Google Scholar 

  21. 21.

    Zhang Y, Shen X, Pan W. Adjusting for population stratification in a fine scale with principal components and sequencing data. Genet Epidemiol. 2013;37(8):787–801.

    PubMed  Article  Google Scholar 

  22. 22.

    Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet. 2011;88(1):76–82.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  23. 23.

    Ma J, Amos CI. Theoretical formulation of principal components analysis to detect and correct for population stratification. PLoS One. 2010;5(9):e12510.

    PubMed  PubMed Central  Article  Google Scholar 

  24. 24.

    Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, McVean G, Durbin R1000 Genomes Project Analysis Group. The variant call format and VCFtools. Bioinformatics. 2011;27(15):2156–8.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  25. 25.

    Wright S. The genetical structure of populations. Ann Eugenics. 1951;15:323–45.

    CAS  Article  Google Scholar 

  26. 26.

    Price AL, Zaitlen NA, Reich D, Patterson N. New approaches to population stratification in genome-wide association studies. Nat Rev Genet. 2010;11(7):459–63.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

Download references

Funding

This work was supported by the national Thousand Youth Talents Plan. The funding body played no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

Author information

Affiliations

Authors

Contributions

SM: conceived the concept, conducted the analyses, and drafted the manuscript. GS: conceived the concept, supervised the work, reviewed and revised the manuscript. All authors have read and approved the manuscript.

Corresponding author

Correspondence to Gang Shi.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1 Text S1

. Proof of eq. (4). Text S2. Proofs of eqs. (5–10). Text S3. FPC as a function of allele frequencies. Text S4. Proofs of rare variants on the eigenvalues. Text S5. Proof of eq. (15). Table S1. Theoretical and empirical values of the inter-population covariance of EGRM (0.4 < MAF ≤ 0.5). Table S2. Theoretical and empirical values of the inter-population covariance of EGRM (0.3 < MAF ≤ 0.4). Table S3. Theoretical and empirical values of the inter-population covariance of EGRM (0.2 < MAF ≤ 0.3). Table S4. Theoretical and empirical values of the inter-population covariance of EGRM (0.1 < MAF ≤ 0.2). Table S5. Theoretical and empirical values of the inter-population covariance of EGRM (0.01 < MAF ≤ 0.1). Table S6. Theoretical and empirical values of the inter-population covariance of EGRM (0.0001 < MAF ≤ 0.01). Table S7. MAFs of the six SNPs used in the simulations. Table S8. Expected variance and covariance with and without the knowledge of μm and σm.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ma, S., Shi, G. On rare variants in principal component analysis of population stratification. BMC Genet 21, 34 (2020). https://doi.org/10.1186/s12863-020-0833-x

Download citation

Keywords

  • Rare variant
  • Population stratification
  • Principal component analysis
  • Single nucleotide polymorphism