 Research article
 Open Access
 Published:
Impact of genetic similarity on imputation accuracy
BMC Genetics volume 16, Article number: 90 (2015)
Abstract
Background
Genotype imputation is a common technique in genetic research. Genetic similarity between target population and reference dataset is crucial for highquality results. Although several reference panels are available, it is often not clear which is the most optimal for a particular target dataset to be imputed. Maximizing genetic similarity between study sample and intended reference panels may be the straight forward method for selecting the genetically bestmatched reference. However, the impact of genetic similarity on imputation accuracy has not yet been studied in detail.
Results
We performed a simulation study in 20 ethnic groups obtained from POPRES. Highquality SNPs were masked and reimputed with MaCH, MaCHminimac and IMPUTE2 using four different HapMap reference panels (CEU, CHBJPT, MEX and YRI). Imputation accuracy was assessed by different statistics. Genetic similarity between ethnic groups and reference populations were measured by F statistics (F _{ ST }) originally proposed by Wright and G statistics (G _{ ST }) introduced by Nei and others. To assess the predictive power of these measures regarding imputation accuracy, we analysed relations between them and corresponding imputation accuracy scores. We found that population genetic distances between homogeneous reference and target populations were strongly linearly correlated with resulting imputation accuracies irrespective of considered distance measure, imputation accuracy measure, missingness and imputation software used. Possible exception was African population.
Conclusion
Usage of G _{ ST } or F _{ ST }related measures for predicting the optimal reference panel for imputation frameworks relying on a specific reference is highly recommended. A cutoff of G _{ ST } < 0.01 is recommended to achieve good imputation results for highfrequency variants and small data sets. The linear relationship is less pronounced for lowfrequency variants for which we also observed a dependence of imputation accuracy on the number of polymorphic sites in the reference. We also show that the software specific measures MaCHRsq and IMPUTEinfo must be interpreted with caution if the genetic distance of target and reference population is high.
Introduction
Genotype imputation is a common technique applied in the context of genome wide association (GWA) analysis. Typically, a set of densely genotyped samples is used as references to infer a large set of untyped or missing markers in the target population. Although one has to deal with the uncertainty of genotypes derived by imputation, this procedure is nowadays standard since it makes largescale genomewide investigations feasible and cost effective. Furthermore, it enables metaanalysis by combining datasets genotyped at different platforms (e.g. Illumina versus Affymetrix arrays) [1]. It is also believed that genotype imputation improves the statistical power of genome wide association studies (GWAS) [2].
Moreover, imputation plays an essential role for the analysis of sequencing data [3]. Although, a dramatic cost reduction of nextgeneration sequencing technology was achieved, wholegenome sequencing of large study samples is still unaffordable. A wayout might be sequencing of a subset of individuals which could serve as an additional reference for imputation [4]. Strategies for selecting the individuals to be sequenced have been suggested recently [5]. These strategies consider genetic similarities between study population, subsets to be sequenced and the reference panel.
A number of different approaches have been suggested for building publically available reference panels that can maximize imputation accuracy. Some imputation software like IMPUTE2 [6] and MaCHAdmix [7] can exploit cosmopolitan references in order to optimize sequence similarity locally. However, other popular imputation frameworks (e.g. MaCH [8] and MaCHminimac [9]) still rely on preselection of reference panels that are most closely matched with the ancestry of the study population For example, CEU is frequently used as imputation reference panel for European and European American samples, while CHB and JPT were chosen to impute samples from East Asian Populations [4].
Genetic distances like different F _{ ST } measures [10–16] or principal component analysis [17] have been proposed to determine the genetic similarity between target and reference datasets. Fstatistics were originally proposed by Wright to assess genetic structure of populations [10, 12]. Therefore, F _{ ST } measures were constructed to evaluate the genetic distance between (homogeneous) populations, or in other words, the degree of genetic variance explained by ethnic subentities. Since first introduction of Wright’s F _{ ST }, a large variety of other F _{ ST }related measures and corresponding estimators were proposed [11–18]. Nei [13, 14, 19] introduced the measure G _{ ST } which is also frequently used for this purpose [15]. A few studies revealed that F _{ ST }like measures calculated between target and reference populations correlate with imputation accuracy [20, 21].
However, it is still unclear how a reference panel with low genetic similarity affects the imputation accuracy. So far, no exact strategy (e.g. cutoffs for F _{ ST }related measures) which could help us to select a wellsuited reference panel has been proposed. To the best of our knowledge, there is no research on the relation between Nei’s G _{ ST } and imputation accuracy. Therefore, in the present paper, we performed a simulation study to investigate the relationship of G _{ ST } and other F _{ ST }like measures and imputation accuracy obtained by three imputation frameworks: MaCH, MaCHminimac and IMPUTE2. All these frameworks can be run with specific rather than cosmopolitan reference panels. Finally, we investigate the impact of missingness and frequency of variants on this relationship. All analyses were performed on the basis of the publically available dataset of POPRES [22].
Materials and methods
POPRES project
POPRES is a project fostering large Population Reference Samples of different ethnic origins [22]. The original POPRES project contains nearly 5,000 individuals of AfricanAmerican, East Asian, South Asian, Mexican and European origin. Individuals included in the POPRES study are collected from different study groups all over the world. POPRES performed Genomewide genotyping of these individuals on the Affymetrix (Mountain View, CA) GeneChip 500 K Array set with the published protocol for 96wellplate format. Sample collection and methods for POPRES are described elsewhere [22]. The datasets used for the present analyses were obtained from dbGaP [23] through dbGaP accession number phs000145.v4.p2.
Datasets
We considered chromosome 22 from the POPRES dataset for our research. This dataset originally consisted of 5,637 SNPs measured in individuals from 35 different populations. To avoid biases due to different sample sizes and to include as many populations as possible into our analysis, we considered an equal number of individuals for each subpopulation (N = 40). If more than 40 individuals are available, a random sample of N = 40 was drawn to rule out effects caused by differing sample sizes. Population groups with less than 40 members were discarded resulting in a total of 20 different ethnic subsets, namely 15 populations of Caucasian origin: Australian, Canadians, German, French, SwissFrench, SwissGerman, Swiss, Italian, Spanish, Irish, British, Belgish, Portuguese, former Yugoslavia, a mixed group of east European origin (a mixture of people from Czechrepublic, Hungary, Poland); two populations of SouthAsian origin: Indians and Punjabis, one eastAsian population: Japanese, one Mexican population: Mexican, and finally, a mixedpopulation of AfricanAmericans (AfAm). Study populations which do not match very closely with the available HapMap references CEU, JPT, CHB, YRI (see below) were supposed to indicate the impact of imperfect reference panels on the target populations. This might be applicable for the following populations: Indian, Punjabis, Yugoslavians, EastEU, Portuguese and AfricanAmericans. Target populations like Europeans are abbreviated by EU, East European populations (eastEU and Yugoslavia) by EEU, South Asians (Punjabi and India) as SASI, Japanese by Jap, Mexican by MEX, South European (Italian, Portuguese) by SEU and African Americans by AfAm.
Reference Panel
1000 Genomes datasets were based on low depth whole genome sequencing data and are generally considered to have lower accuracy than HapMap data. Thus we considered HapMap3 [24] reference panels (NCBI Build 36) to impute the above mentioned populations. Four different preformatted reference panels: CEU, YRI, MEX and JPT + CHB provided by the MaCHdevelopers [25] and IMPUTEdevelopers [26] were considered. In a fullfactorial design, we imputed our target populations with these reference panels.
Strand verification of SNPs
Genomic assembly of the original POPRES data was identical to Affymetrix release 25 NSP25 and STY25 and the corresponding rsIDs were identified by NCBI build “b36” with UCSC version “hg18”. Strand alignment between study sample and reference data was performed using fcGENE [27] and PLINK [28]. SNPs with ambiguous strands and SNPs which could not be found in the HapMap3 reference panel were removed. In total 1,014 SNPs could not be matched to HapMap3 reference panels and were excluded. 4,623 SNPs overlapping with HapMap reference panel remained for further analyses.
Selection of goodquality (GQ) SNPs
Good quality (GQ)SNPs were selected with stringent filtering criteria of genotype quality. These GQ SNPs were then assumed to express true genotypes for our experimental study. In analogy to our previous research [29], we masked these SNPs and reimputed them to evaluate the imputation accuracy as explained in the next section. More precisely, we compared the posterior genotype probability distributions produced by the imputation software with the corresponding true genotypes. To select GQ SNPs, we apply the following quality criteria: average call rate (CR averaged over all populations > =95 %), average minor allele frequency (MAF averaged over all populations > =0.1) and pvalues of stratified Hardy Weinberg Equilibrium Test (p (HWE) > =10e2). Since the samples were from multiple ethnic group, we used exact stratified test of HWE [30]. A total of 457 SNPs passed these quality criteria.
Masking Process
We performed the masking process in two phases. First, we masked all good quality SNPs and imputed them with the imputation frameworks: MaCH, MaCHminimac and IMPUTE2. We also considered additional scenarios where we masked only 70 % and 50 % of the previously selected good quality SNPs. This type of masking was performed in such a way that all SNPs masked in the lower percentage of missingness were also masked in the higher percentages of missingness. The first type of masking (100 %) was used to investigate the relationships between G _{ ST } and F _{ ST }related scores and corresponding imputation accuracy. The second type of masking was used to study the impact of different degrees of missingness on these relations. For this purpose, we only compared the 50 % GQ SNPs which were masked in all three missingness scenarios to avoid bias introduced by SNP selections.
Imputation
Imputations were performed separately for each of the previously mentioned sets of populations combined with any of the four reference panels. As suggested by MaCH developers, imputation with this software was performed in two steps: In the first step, imputation error rate and recombination rate were estimated. These two model parameters were determined by running the “greedy” algorithm for 100 iterations and were used in the second step to determine the transition probabilities of the underlying Hidden Markov Model [8]. In the second step, the most likely genotype probability distributions of each genotype at each individual and the imputation quality measured by the software specific Rsq score were determined. Commands used for MaCHimputation are provided in Additional file 1. The relative performance of imputation methods differ greatly as a function of sample sizes, marker densities and parameters of the algorithm such as the number of EM iterations. Therefore, the same standard parameter settings were used for each imputation process.
Imputation with MaCHminimac was also performed in two steps. In the first step, MaCH was used to predict the haplotypes of the study data sets, and then, minimac was used to calculate the posterior probabilities of the genotypes using these haplotypes.
As suggested by the software developer, imputation with IMPUTE2 was performed in a segmented way by defining different genomic intervals approximately of size 5 MB. An internal buffer region of 250 kb on both sides of the analysis interval was used to avoid the margin effects of chromosome segmentation.
After imputation, we compared the estimated posterior distribution with the measured genotypes as explained below. Considering four reference panels, 20 target populations, two missing scenarios and three software packages, a total of 480 imputations were performed.
Assessment of imputation accuracy
A common strategy for determining imputation performance is to compare true genotypes (genotypes measured by a technique with high confidence or consensus genotypes) with corresponding imputation results. Here, we directly compared the posterior distributions of our reimputed GQSNPs with corresponding measured genotypes applying our recently proposed Hellinger and SEN score [31]. Both measures are platform independent. While Hellinger score measures the distance of imputed and measured genotype probability distribution, SEN score is maximal if their expectations are identical. Thus, SEN essentially compares gene doses. Cutoffs of 0.95 for SEN score and 0.45 for Hellinger score respectively were considered as indicators of good imputation accuracy (for motivation, see also Fig. 2 below). We also analysed imputation accuracy using the software specific measures MaCHRsq and IMPUTEinfo determined during the imputation process. MaCHRsq measure is basically defined as the ratio of the empirically observed variance of the allele dosage to the expected binomial variance under HardyWeinberg equilibrium [32]. Similarly, IMPUTEinfo score is the relative statistical information about the SNP allele frequency derived from the imputed data [33]. These two softwarespecific measures are defined at SNPlevel and are useful to assess imputation quality of SNPs for which no measurements are available. These scores are widely applied to remove SNPs with low imputation accuracy during postimputation quality control.
While Hellinger and SEN scores assess agreement of imputed and observed genotypes individually, the software specific measures assess imputation quality for entire SNPs, i.e. cannot be interpreted for single genotypes.
Estimation of G _{ ST } and other F _{ ST }related measures
Nei’s G _{ ST } is defined as the ratio of average gene diversity within subpopulations and the gene diversity of the total pooled population:
where H _{ T } is the heterozygosity expected under HardyWeinberg equilibrium for the total pooled population and D’_{ ST } is the average gene diversity between the subpopulations [13, 14, 19]. For twoallelic markers, Bhatia et al. [16] recommended the estimator of G _{ ST } at any particular k ^{th} marker as:
where p _{1} ^{k} and p _{2} ^{k} are the allele frequencies of the reference allele at the k ^{th} marker in the two populations. To calculate G _{ ST } between two population groups genotyped at N markers, one can use the formula:
Computation of pairwise G _{ ST } between any two population is implemented in the most recent version of fcGENE [27]. Small values of G _{ ST } indicate that allele frequencies between the two populations are similar, i.e. the genetic distance between them is small.
Regarding F _{ ST }related measures, we considered F _{ ST } ^{R} described in the work of Reich et al. [17], and implemented in the program EIGENSOFT, in which a blockjack knife procedure is used to estimate the standard error of F _{ ST } ^{R} . For any k ^{th} Marker, F _{ ST } ^{R} is calculated as
where a_{1} and a_{2} are the specific allele counts and m_{1} and m_{2} are the total allele counts of the marker in two population. Heterozygosities of the markers are 2 h_{1} and 2 h_{2}, with $${\mathrm{h}}_1=\frac{{\mathrm{a}}_1\left({\mathrm{n}}_1{\mathrm{a}}_1\right)}{{\mathrm{n}}_1\left({\mathrm{n}}_11\right)}$$ , $${\mathrm{h}}_2=\frac{{\mathrm{a}}_2\left({\mathrm{n}}_22\right)}{{\mathrm{n}}_2\left({\mathrm{n}}_21\right)}$$ respectively. Let n_{1} and n_{2} be the numbers of individuals genotyped in the two populations at the k ^{th} marker. The allele counts a_{1} and a_{2} and the total allele counts m_{1} and m_{2} can be determined as a_{1} = 2u_{1} + v_{1}, a_{2} = 2u_{2} + v_{2}, m_{1} = 2n_{1}, m_{2} = 2n_{2}, where u_{1} and v_{1}, and u_{2} and v_{2} are the counts of homozygotes and heterozygotes in the first, and in the second population respectively. Now if there are N markers genotyped in each population, an unbiased estimator of F_{ST} can be defined as
In order to compare the relative performance of different F _{ ST }related measures in predicting imputation accuracy, we also computed the original and modified estimators of F _{ ST } (denoted by F _{ ST } ^{WC} and F _{ ST } ^{mWC} ) proposed by Weir and Cockerham [11]. F _{ ST } ^{WC} between two population was calculated as follows
where
A modified estimator F _{ ST } ^{mWC} of Weir and Cockerham’s F _{ ST } is defined as follows [16]:
These formula of F _{ ST } ^{WC} and F _{ ST } ^{mWC} are also implemented in fcGENE. While computing Weir and Cockerham F _{ ST } and Nei’s G _{ ST }, we used the same reference alleles throughout all populations.
In previous studies [20], F _{ ST } was computed for individual SNPs and then averaged across SNPs. However these F _{ ST } estimators does not account for haplotype diversity very well [34]. Therefore, in our formula all quantities were averaged over all SNPs first, and then, F _{ ST } is calculated. This estimate is more precise, i.e. results in smaller standard errors as pointed out in [17, 35].
Correlation statistics
Calculation of imputation quality scores is based on GQ SNPs masked prior to imputation. After calculation of G _{ ST } and other F _{ ST } related measures between POPRES populations and the four reference panels considered, we compared population distances with corresponding imputation accuracy scores. Different scatter plots were generated using the Rpackage ‘ggplot2’ [36] allowing to construct smoothed curves of nonlinear relationships. To determine the correlation among G _{ ST } and other F _{ ST }related measures, we used Kendall’s rank correlation coefficient (Kendall’s tau coefficient) [37], which measures the similarity of the ordering of the data to be compared.
Results
Comparison of measures of genetic distance between populations
First, we compared our different measures of population distances derived from all pairwise comparisons of POPRES data subsets and reference populations. Results can be found in Fig. 1. The figure shows that Nei’s G _{ ST } is almost equivalent to the other F _{ ST }related measures, i.e. a strong linear relationship is observed. F _{ ST } ^{WC} and F _{ ST } ^{mWC} are better correlated with G _{ ST } than F _{ ST } ^{R} . In the scatter plot of G _{ ST } and F _{ ST } ^{R} , the strongest deviation from the linear trend is caused by the population group AfAm with reference panel CHB. JPT. We investigated the cause of this deviation and found that lowfrequencyvariants (SNPs with MAF ≤ 0.05) strongly influence F _{ ST } ^{R} while G _{ ST } is robust (see Additional file 1: Table S1).
Characterization of measures of imputation accuracy
Next, we analysed the accuracy scores obtained from imputing each of the 20 target populations with any of the four reference panels using the three imputation frameworks. As an example, distribution of imputation accuracy scores of target population from Germany imputed with the four different reference panels applying MaCH are displayed in Fig. 2.
As expected, the target population from Germany is best imputed with the genetically closest reference panel “CEU” followed by “MEX”, “CHB. JPT” and “YRI”. A cutoff for Hellinger score of 0.45 almost perfectly separates correctly and incorrectly imputed genotypes. We obtained similar trends for other target populations as for example “AfAm” was best imputed with the ethnically closest reference panel “YRI” (see Additional file 1: Figure S1) although overall imputation yield is substantially reduced in this population.
MaCHRsq and IMPUTEinfo scores are typically used for postimputation quality control. These scores are essentially based on the ratio of sample variance of allele frequency during imputation and its expected variance under HardyWeinberg equilibrium. The expected variance depends on the allelic frequency of the corresponding SNP in the reference panel considered. Thus, in contrast to Hellinger or SEN score, imputation accuracy determined by MaCHRsq and IMPUTEinfo scores depends on the reference sample used. We studied the relationship between MaCHRsq/ IMPUTEinfo score and the average Hellinger score of GQ SNPs. Exemplarily, results of AfAm imputed with the four different reference panels are shown in Fig. 3. We observed a monotonous relationship between average Hellinger score and MachRsq/IMPUTEinfo irrespective of the target dataset or reference used. However, it turned out that for given MachRsq/IMPUTEinfo values, corresponding average Hellinger scores were higher for genetically matching reference panels compared to mismatching reference panels. This behaviour is especially pronounced for AfAm population where reference panels other than YRI result in particularly low average Hellinger scores even if corresponding MaCHRsq/ IMPUTEinfo values are high (Fig. 3). This indicates that MaCHRsq/IMPUTEinfo values measure imputation quality accurately only if a genetically matching reference is used.
Correlation of Nei’s G _{ ST } and F _{ ST }related scores with imputation accuracy
We investigated the relationship between G _{ ST } and F _{ ST }related scores and imputation accuracy. In view of the good correlation of G _{ ST } and F _{ ST }related scores, we focus on G _{ ST } in the following. Since good Hellinger scores (≥0.45) represent correctly imputed genotypes in most cases, percentages of genotypes with Hellinger score ≥0.45 in dependence on G _{ ST } serve as primary outcome of our analyses. Results can be found in Fig. 4 showing the scatter plot between pairwise Nei’s G _{ ST } and the percentage of genotypes with Hellinger score ≥0.45 for all target populations imputed with the four reference panels.
We observed an almost linear relationship between G _{ ST } and this measure of imputation quality for all three software packages. Pearson’s correlation coefficients between G _{ ST } and imputation quality are 0.95, 0.93 and 0.91 for MaCH, Machminimac and IMPUTE2, respectively. We conclude that G _{ ST } is a good predictor of imputation accuracy for all type of imputation frameworks used under the bestmatching policy for selecting a reference panel. Small values of G _{ ST } imply high imputation accuracies and vice versa. Only AfAm is an outlier of this relationship resulting in particularly low imputation quality even if YRI as best matching reference panel was used.
This outlying behaviour of AfAm was consistently observed for all three software packages considered. To analyse whether the sample size has an impact, we additionally considered the complete AfAm sample of POPRES with N = 252. For the reference sample YRI the results of all software packages were slightly improved, but we also observed a small reduction in G _{ ST }. For the other reference panels, we observed no difference to the results of MaCHminimac and IMPUTE2 obtained for the original sample (N = 40). However, for MaCH we observed a small deterioration of Hellinger score for the larger sample. Results are shown in Additional file 1: Figures S10 and S11. We conclude that sample size alone does not explain the observed outlying behaviour of AfAm.
Correlation between MaCHRsq/IMPUTEinfo score and G _{ ST } showed similar behaviour (Additional file 1: Figure S2). Moreover, G _{ ST } was also highly correlated with the percentage of genotypes with SEN ≥0.95 as shown in Additional file 1: Figure S3. Analyzing the relationship between G _{ ST }and imputation accuracy in more detail, it can be recommend that G _{ ST } between the target and reference population should be smaller than 0.04 to achieve a yield of at least 87 % wellimputed common SNPs. AfAm is an exception of this rule. In our data, good imputation results with about 90 % correctly imputed GQ SNPs are obtained if the value of G _{ ST } is less than 0.01. Since the largest set of POPRES populations are from Europe, we performed a more detailed analysis of this subgroup (Fig. 5). Interestingly, using CEU as reference, we obtained again a trend towards lower imputation accuracy for larger values of G _{ ST }. Notably, the populations from east and south Europe show somewhat lower yield of well imputed genotypes than those from Central and Western Europe.
Scatter plots of other measures of population distance (e.g. F _{ ST } ^{R} ) and imputation accuracy are similar. Results for F_{ST} ^{R} can be found in Additional file 1: Figures S4 and S5.
We also computed correlation coefficients between G _{ ST } and imputation accuracy of the 20 POPRES samples in dependence on reference, software and measure of imputation quality (Table 1). A strong linear trend was observed for all of these scenarios.
Dependence on degree of missingness
In order to study the impact of different degrees of missingness on the relation between imputation accuracy and F _{ ST }related measures, we compared G _{ ST }, F _{ ST } ^{R} , F _{ ST } ^{WC} and F _{ ST } ^{mWC} with imputation accuracies at different degrees of missingness. Although, degree of missingness has a clear impact on overall imputation accuracy, it turned out that this has only a marginal impact on the observed linear relationship between population distance and imputation accuracy. Fig. 6 shows the results for Nei’s G _{ ST }. Results of other accuracy endpoints and measures of genetic distance are similar and can be found in the supplement material (Additional file 1: Figures S6, S7 and S8).
Impact on low frequency variants
Finally, we analyzed how G _{ ST } and other F _{ ST }related scores correlate with imputation accuracies of lowfrequency variants (MAF ≤ 5 %). Fig. 7 shows the results for G _{ ST } and the softwarespecific measures of imputation accuracy. As expected, the overall yield of wellimputed lowfrequency variants is lower than for the common variants. Moreover, the correlation of G _{ ST } and imputation accuracy is also markedly reduced compared to the common variants. Correlation between F _{ ST } and the softwarespecific measures of lowfrequency and common variants is displayed in Additional file 1: Figure S9 showing similar results.
Discussion
Imputation of untyped SNPs or missing genotypes is a common technique in genomewide analyses. However, accuracy of imputation is difficult to predict as it depends on a variety of factors including preimputation quality control, genetic similarity of reference and target population, and its haplotype structure. We recently performed a comprehensive simulation study analyzing the effect of preimputation quality control on accuracy of imputation [29]. In the present paper, we studied the impact of reference panels on imputation accuracy. For this purpose, we considered the three software packages MaCH, MaCHminimac and IMPUTE2 which can be run with a populationspecific reference panel. Other approaches relying on mixed reference panels were proposed recently [21, 38] circumventing the issue of selecting appropriate references (e.g. IMPUTE2 [39], MaCHAdmix [7]). However, previous researches [29, 46] and our own results (submitted) showed that such algorithms could reduce imputation quality compared to frameworks relying on specific references. Thus, software packages like MaCH or MaCHminimac are still frequently in use [4, 40–44]. It is beyond any doubt that in this case, the reference panel should ethnically match with the target population as best as possible so that it can represent the haplotype structures of the individuals in the target population. Consequently, for these imputation frameworks, it is recommended to choose a reference panel bestmatched with the ancestry of the target population. This can be achieved for example by analysing measures of genetic distances between target and reference populations [20, 45]. However the relation between genetic distance and imputation accuracy is not completely understood and requires further research.
In order to analyse this issue in more detail, we performed a simulation study on the basis of ethnic subsamples of the publically available POPRES panel [22]. A total of 20 target datasets were considered. However, samples were small regarding both, number of SNPs and individuals. This implies that our results may be valid only for small or mediumsized data sets.
Four ethnic reference data sets derived from HapMap3 (NCBI Build 36) were considered, namely CEU, YRI, MEX and JPT + CHB. Reference data are provided through the home pages of MaCH and IMPUTEsoftware developers. These four reference data sets allowed us investigating the dependence of imputation accuracy on genetic similarity between target and reference panels for a higher number of combinations. In our paper, we focused on imputation of highfrequency variants. Although relying on HapMap3 reference might be a limitation of our study, we expect that the results for these variants are similar if switching to 1 kG reference panels. This is based on the observation that the yield of wellimputed highfrequency variants is comparable to our experiences (not shown).
We investigated imputation accuracy by comparing genotypes of masked SNPs with their posteriori distributions after different imputation scenarios. We only masked SNPs of good quality to ensure that error of measured genotypes is as small as possible. Several measures of comparisons of measured and imputed genotypes were considered, namely bestguess genotypes, SEN score and Hellinger score. While Hellinger score measures the agreement of measured and posterior genotype distribution, SEN score is maximal if their expectations coincide [29]. We also studied the software specific quality measure MaCHRsq and IMPUTEinfo score which however are only defined for entire SNPs rather than single genotypes. An important result of our study is that these measures critically depends on the reference panel used. As a consequence, these scores can predict the imputation accuracy only if the reference panel is genetically similar to the target population. Otherwise even high MaCHRsq/IMPUTEinfo scores do not guarantee that the estimated genotypes are correct.
To evaluate genetic similarity between different target and reference populations, we computed pairwise G _{ ST }, F _{ ST }related scores using our newly developed software fcGENE [27] and SMARTPCA [17] which calculates pairwise F _{ ST } ^{R} between any two populations. The measure G _{ ST } and all of the F _{ ST }related measures were strongly correlated. Relationships are almost linear except for the AfAm population which is a clear outlier. A detailed analysis revealed that G _{ ST } was slightly better correlated with F _{ ST } ^{WC} and F _{ ST } ^{mWC} than with F _{ ST } ^{R} . In previous research [20, 45], G _{ ST } and F _{ ST }related scores were estimated for SNPs first and then averaged across all SNPs. However, such type of estimates may not reflect haplotype diversity among populations of different ethnicities [15, 34, 35]. Therefore, we decided to estimate these measures in a haplotypewise manner averaging their components (i.e. numerator and denominator of the formula) over all SNPs first. Then, the measure is calculated as the ratio of these estimates.
Independent of the type of measures considered, we observed an almost linear relationship between genetic distance and resulting imputation accuracy. Only AfAm showing particularly low imputation accuracy even if using YRI as reference violates this finding. Moreover, even though degree of missingness was shown to be a strong determinant of imputation accuracy [29], the linearity of the above mentioned relationship is preserved for different degrees of missingness. In view of this linear relationship, one can estimate imputation accuracy for a given pair of target and reference population. Relying on Nei’s G _{ ST } we observed satisfactory imputation results for a cutoff of 0.04. Excellent results are achieved if G _{ ST } is less than 0.01. We recommend this threshold for selecting a reference panel at least for medium or small datasets considered in our study. Larger samples of genetically different groups are required for generalization of our result.
Finally, we analysed the performance of genotype imputation for low frequency variants. Although it is known that the imputation of low frequency variants is particularly difficult [46, 47], it has become important in the context of nextgeneration sequencing. Imputation quality of these variants is much lower than for highfrequency variants. Still we found a negative trend between genetic distance and imputation quality which however is less pronounced than for the highfrequency variants. Interestingly, besides the ethnic similarity, the number of polymorphic sites in the reference panels influences imputation accuracy of lowfrequency variants.
As mentioned previously, imputation accuracy is not solely determined by the genetic similarity between the reference and target population. An example is the AfAm population showing lower accuracy than expected on the basis of the genetic distance. The reason is the more complex haplotype structure and generally reduced levels of linkage disequilibrium in African populations which is not measured by the genetic distance [20]. Additional populations of African ancestry are required to analyse this issue and its impact on the relation of genetic similarity and imputation accuracy in more detail.
Conclusion
We conclude that G _{ ST } and other measures of genetic similarity of homogenous target and reference populations are good predictors of imputation accuracy for imputation frameworks relying on bestmatched reference panels. An almost linear relationship of G _{ ST } and various measures of imputation accuracy was observed with exception of the AfricanAmerican population considered. In our data, excellent imputation results are achieved if G _{ ST } is less than 0.01. However, this threshold might not hold for African populations for which reduced linkage disequilibrium is a stronger determinant of imputation accuracy. For lowfrequency variants, the same trend between G _{ ST } and imputation quality was observed, but here, panels with higher number of monomorphic sites (i.e. CHBJPT) perform below the average. The software specific measures MaCHRsq or IMPUTEinfo score must be interpreted with caution if the genetic distance of target and reference population is high.
Abbreviations
 GWA:

Genome wide association
 GWAS:

Genome wide association studies
 F_{ST} :

Fstatistics
 G_{ST} :

Gstatistics
 GQ:

Goodquality
 HQ:

Highquality
 MAF:

Minor allele frequency
 CR:

Call rate
 EEU:

East European populations
 SASI:

South Asians
 EU:

Europeans
 Jap:

Japanese
 MEX:

Mexican
 SEU:

South European
 AfAm:

AfricanAmericans
References
 1.
Zeggini E, Scott LJ, Saxena R, Voight BF, Marchini JL, Hu T, et al. Metaanalysis of genomewide association data and largescale replication identifies additional susceptibility loci for type 2 diabetes. Nat Genet. 2008;40:638–45.
 2.
Clark AG, Li J. Conjuring SNPs to detect associations. Nat Genet. 2007;39:815–6.
 3.
1000 Genomes Project Consortium, Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, et al. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491:56–65.
 4.
Pistis G, Porcu E, Vrieze SI, Sidore C, Steri M, Danjou F, et al. Rare variant genotype imputation with thousands of studyspecific wholegenome sequences: implications for costeffective study designs. Eur J Hum Genet. 2014;23(7):975–83.
 5.
Peil B, Kabisch M, Fischer C, Hamann U, Bermejo JL. Tailored Selection of Study Individuals to be Sequenced in Order to Improve the Accuracy of Genotype Imputation: Choosing Individuals for Sequencing to Impute. Genet Epidemiol. 2015;39:114–21.
 6.
Marchini J, Howie B, Myers S, McVean G, Donnelly P. A new multipoint method for genomewide association studies by imputation of genotypes. Nat Genet. 2007;39:906–13.
 7.
Liu EY, Li M, Wang W, Li Y. MaCHAdmix: Genotype Imputation for Admixed Populations: MaCHAdmix: Imputation for Admixed Populations. Genet Epidemiol. 2013;37:25–37.
 8.
Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol. 2010;34:816–34.
 9.
Fuchsberger C, Abecasis GR, Hinds DA. minimac2: faster genotype imputation. Bioinformatics. 2015;31:782–4.
 10.
Wright S. Genetical Structure of Populations. Nature. 1950;166:247–9.
 11.
Weir BS, Cockerham CC. Estimating FStatistics for the Analysis of Population Structure. Soc Study Evol. 1984;38:1358–70.
 12.
Wright S. The Interpretation of Population Structure by FStatistics with Special Regard to Systems of Mating. Evolution. 1965;19:395.
 13.
Nei M. Definition and estimation of fixation indices. Evolution. 1986;40:643–5.
 14.
Nei M, Chesser RK. Estimation of fixation indices and gene diversities. Ann Hum Genet. 1983;47:253–9.
 15.
Holsinger KE, Weir BS. Genetics in geographically structured populations: defining, estimating and interpreting FST. Nat Rev Genet. 2009;10:639–50.
 16.
Bhatia G, Patterson N, Sankararaman S, Price AL. Estimating and interpreting FST: The impact of rare variants. Genome Res. 2013;23:1514–21.
 17.
Reich D, Thangaraj K, Patterson N, Price AL, Singh L. Reconstructing Indian population history. Nature. 2009;461:489–94.
 18.
Hudson RR, Slatkin M, Maddison WP. Estimation of levels of gene flow from DNA sequence data. Genetics. 1992;132:583–9.
 19.
Nei M. Molecular Evolutionary Genetics. New York: Columbia University Press; 1987.
 20.
Huang L, Jakobsson M, Pemberton TJ, Ibrahim M, Nyambo T, Omar S, et al. Haplotype variation and genotype imputation in African populations. Genet Epidemiol. 2011;35:766–80.
 21.
Howie B, Marchini J, Stephens M, Chakravarti A. Genotype Imputation with Thousands of Genomes. G358 GenesGenomesGenetics. 2011;1:457–70.
 22.
Nelson MR, Bryc K, King KS, Indap A, Boyko AR, Novembre J, et al. The Population Reference Sample, POPRES: A Resource for Population, Disease, and Pharmacological Genetics Research. Am J Hum Genet. 2008;83:347–58.
 23.
POPRES: Population Reference Sample [http://www.ncbi.nlm.nih.gov/projects/gap/cgibin/collection.cgi?study_id=phs000145.v2.p2]
 24.
International HapMap 3 Consortium, Altshuler DM, Gibbs RA, Peltonen L, Altshuler DM, Gibbs RA, et al. Integrating common and rare genetic variation in diverse human populations. Nature. 2010;467:52–8.
 25.
Abecasis GR. Homepage of Imputation software MaCH1.0. 2014.
 26.
Marchini J. Homepage of IMPUTE2. 2009.
 27.
Roshyara NR, Scholz M. fcGENE: A Versatile Tool for Processing and Transforming SNP Datasets. PLoS One. 2014;9:e97589.
 28.
Purcell S, Neale B, ToddBrown K, Thomas L, Ferreira MAR, Bender D, et al. PLINK: a tool set for wholegenome association and populationbased linkage analyses. Am J Hum Genet. 2007;81:559–75.
 29.
Roshyara NR, Kirsten H, Horn K, Ahnert P, Scholz M. Impact of preimputation SNPfiltering on genotype imputation results. BMC Genet. 2014;15.
 30.
Troendle JF, Yu KF. A note on testing the HardyWeinberg law across strata. Ann Hum Genet. 1994;58(Pt 4):397–402.
 31.
Roshyara NR, Kirsten H, Horn K, Ahnert P, Scholz M. Impact of Preimputation SNPfiltering on Genotype Imputation Results. PLoS One. 2012;7(11):e50610.
 32.
De Bakker PIW, Ferreira MAR, Jia X, Neale BM, Raychaudhuri S, Voight BF. Practical aspects of imputationdriven metaanalysis of genomewide association studies. Hum Mol Genet. 2008;17:R122–8.
 33.
Marchini J, Howie B. Genotype imputation for genomewide association studies. Nat Rev Genet. 2010;11:499–511.
 34.
Browning SR, Weir BS. Population Structure With Localized Haplotype Clusters. Genetics. 2010;185:1337–44.
 35.
Akey JM. Interrogating a HighDensity SNP Map for Signatures of Natural Selection. Genome Res. 2002;12:1805–14.
 36.
Wichmann H. ggplot2: Elegant Graphics for Data Analysis (Use R!). New York NY: Springer; 2009.
 37.
Kendall MG. A NEW MEASURE OF RANK CORRELATION. Biometrika. 1938;30:81–93.
 38.
Huang GH, Tseng YC. Genotype imputation accuracy with different reference panels in admixed populations. BMC Proc. 2014;8 Suppl 1:S64.
 39.
Howie BN, Donnelly P, Marchini J. A Flexible and Accurate Genotype Imputation Method for the Next Generation of GenomeWide Association Studies. PLoS Genet. 2009;5, e1000529.
 40.
Al Olama AA, KoteJarai Z, Berndt SI, Conti DV, Schumacher F, Han Y, et al. A metaanalysis of 87,040 individuals identifies 23 new susceptibility loci for prostate cancer. Nat Genet. 2014;46:1103–9.
 41.
Chang ALS, Raber I, Xu J, Li R, Spitale R, Chen J, et al. Assessment of the Genetic Basis of Rosacea by GenomeWide Association Study. J Invest Dermatol. 2015;135(6):1548–55.
 42.
KreinerMøller E, MedinaGomez C, Uitterlinden AG, Rivadeneira F, Estrada K. Improving accuracy of rare variant imputation with a twostep imputation approach. Eur J Hum Genet. 2015;23:395–400.
 43.
Van Leeuwen EM, Karssen LC, Deelen J, Isaacs A, MedinaGomez C, Mbarek H, et al. Genome of the Netherlands populationspecific imputations identify an ABCA6 variant associated with cholesterol levels. Nat Commun. 2015;6:6065.
 44.
Giambartolomei C, Vukcevic D, Schadt EE, Franke L, Hingorani AD, Wallace C, et al. Bayesian Test for Colocalisation between Pairs of Genetic Association Studies Using Summary Statistics. PLoS Genet. 2014;10, e1004383.
 45.
Huang L, Li Y, Singleton AB, Hardy JA, Abecasis G, Rosenberg NA, et al. GenotypeImputation Accuracy across Worldwide Human Populations. Am J Hum Genet. 2009;84:235–50.
 46.
Howie B, Fuchsberger C, Stephens M, Marchini J, Abecasis GR. Fast and accurate genotype imputation in genomewide association studies through prephasing. Nat Genet. 2012;44:955–9.
 47.
Zheng HF, Rong JJ, Liu M, Han F, Zhang XW, Richards JB, et al. Performance of Genotype Imputation for Low Frequency and Rare Variants from the 1000 Genomes. PLoS One. 2015;10, e0116487.
Acknowledgements
MS and NRR were funded by the Leipzig Interdisciplinary Research Cluster of Genetic Factors, Clinical Phenotypes and Environment (LIFE Center, Universitaet Leipzig). LIFE is funded by means of the European Union, by the European Regional Development Fund (ERFD), the European Social Fund and by means of the Free State of Saxony within the framework of the excellence initiative.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
Study design: NRR, MS. Data analysis and simulation: NRR. Writing the manuscript: NRR, MS. Both authors read and approved the final manuscript.
Additional file
Additional file 1:
Supplementary Table S1 and Supplementary Figures S1S11.
Rights and permissions
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Roshyara, N.R., Scholz, M. Impact of genetic similarity on imputation accuracy. BMC Genet 16, 90 (2015) doi:10.1186/s1286301502482
Received
Accepted
Published
DOI
Keywords
 Genotype imputation
 Reference panel
 Genetic similarity
 F_ST
 G_ST
 SNP data
 Imputation quality