Population-genetic comparison of the Sorbian isolate population in Germany with the German KORA population using genome-wide SNP arrays

Gross, Arnd; Tönjes, Anke; Kovacs, Peter; Veeramah, Krishna R; Ahnert, Peter; Roshyara, Nab R; Gieger, Christian; Rueckert, Ina-Maria; Loeffler, Markus; Stoneking, Mark; Wichmann, Heinz-Erich; Novembre, John; Stumvoll, Michael; Scholz, Markus

doi:10.1186/1471-2156-12-67

Research article
Open access
Published: 28 July 2011

Population-genetic comparison of the Sorbian isolate population in Germany with the German KORA population using genome-wide SNP arrays

Arnd Gross^1,2,
Anke Tönjes^3,4,
Peter Kovacs⁵,
Krishna R Veeramah^6,7,8,
Peter Ahnert^1,2,
Nab R Roshyara^1,2,
Christian Gieger⁹,
Ina-Maria Rueckert⁹,
Markus Loeffler^1,2,
Mark Stoneking¹⁰,
Heinz-Erich Wichmann^9,11,12,
John Novembre⁶,
Michael Stumvoll^3,4 &
…
Markus Scholz^1,2

BMC Genetics volume 12, Article number: 67 (2011) Cite this article

9120 Accesses
21 Citations
19 Altmetric
Metrics details

Abstract

Background

The Sorbs are an ethnic minority in Germany with putative genetic isolation, making the population interesting for disease mapping. A sample of N = 977 Sorbs is currently analysed in several genome-wide meta-analyses. Since genetic differences between populations are a major confounding factor in genetic meta-analyses, we compare the Sorbs with the German outbred population of the KORA F3 study (N = 1644) and other publically available European HapMap populations by population genetic means. We also aim to separate effects of over-sampling of families in the Sorbs sample from effects of genetic isolation and compare the power of genetic association studies between the samples.

Results

The degree of relatedness was significantly higher in the Sorbs. Principal components analysis revealed a west to east clustering of KORA individuals born in Germany, KORA individuals born in Poland or Czech Republic, Half-Sorbs (less than four Sorbian grandparents) and Full-Sorbs. The Sorbs cluster is nearest to the cluster of KORA individuals born in Poland. The number of rare SNPs is significantly higher in the Sorbs sample. FST between KORA and Sorbs is an order of magnitude higher than between different regions in Germany. Compared to the other populations, Sorbs show a higher proportion of individuals with runs of homozygosity between 2.5 Mb and 5 Mb. Linkage disequilibrium (LD) at longer range is also slightly increased but this has no effect on the power of association studies.

Oversampling of families in the Sorbs sample causes detectable bias regarding higher FST values and higher LD but the effect is an order of magnitude smaller than the observed differences between KORA and Sorbs. Relatedness in the Sorbs also influenced the power of uncorrected association analyses.

Conclusions

Sorbs show signs of genetic isolation which cannot be explained by over-sampling of relatives, but the effects are moderate in size. The Slavonic origin of the Sorbs is still genetically detectable.

Regarding LD structure, a clear advantage for genome-wide association studies cannot be deduced. The significant amount of cryptic relatedness in the Sorbs sample results in inflated variances of Beta-estimators which should be considered in genetic association analyses.

Background

The Sorbs living in the Upper Lusatia region of Eastern Saxony are one of the few historic ethnic minorities in Germany. They are of Slavonic origin speaking a west Slavic language (Sorbian), and it is assumed that they have lived in ethnic isolation among the German majority during the past 1100 years [1]. Therefore, this population may be of special interest for genetic studies of complex traits.

The value of isolated populations for the discovery of genetic modifiers of diseases or quantitative traits is discussed controversially [2–6]. On the one hand, reduced genetic and environmental variability of isolated populations could increase genotypic relative risks [7, 8]. In combination with the generally higher degree of linkage disequilibrium (LD) in isolated populations, this could improve the power of genetic association studies [5, 6, 9–11]. On the other hand, studies in isolated populations are often limited in size and, therefore, cannot match modern genome-wide association studies and meta-analyses comprising several tens of thousands of individuals.

Nowadays, it is common practice to combine all available genotyped and phenotyped populations in large-scale, whole genome meta-analyses or pooled analyses in order to identify even very small genetic effects as commonly observed for complex traits. Spurious associations caused by the genetic sub-structures of combined populations are the most serious concern of this approach [12–15], implying the need for appropriate adjustment strategies [16, 17]. This is especially true if evidence from isolated and outbred populations is combined as this approach necessitates a thorough comparison of populations by population genetic means in order to determine their "degree of isolation" [6]. For this purpose, different methods have been proposed in the literature. For example, length and number of runs of homozygosity (ROHs) are discussed as an appropriate measure of isolation since they measure the degree of parental consanguinity [18]. LD is estimated to be higher in isolated populations because of lower generation numbers resulting in fewer recombination events [5, 6]. Due to the smaller size of the founder population, it can also be expected that there is a lower number of polymorphisms in isolated populations [6, 19, 20]. Other markers of population structure such as F-statistics [21] are related to the measures mentioned above. Furthermore, genetic distances between populations can be determined by principal components analysis (PCA), allowing to quantify how closely populations are related [22]. By this technique genetic information can be mapped to topographic maps [14] allowing the assessment of a new indicator of isolation in the sense that an isolated population could be genetically far away from their geographic location. So far there appears to be no single measure sufficient to characterize the isolation of a population.

Another characteristic feature of isolated populations is the putatively higher degree of cryptic relatedness in randomly drawn samples. This is a serious concern in genetic association analysis and needs to be addressed with appropriate statistical methods [17, 23–25]. Relatedness of individuals could also interact with the above mentioned measures of isolation of populations. Thus, when comparing two populations with different degrees of cryptic relatedness, it is not easy to decide whether differences in these measures can be traced back to different degrees of isolation or simply to over-sampling of related subjects.

The degree of isolation of the Sorbs has been studied in the past by the analysis of Y-chromosomal markers [26]. Recently, we compared a subset of about 200 Sorbs with other European isolates using 30,000 SNPs measured by microarrays [1]. In this analysis, the Sorbs expressed only moderate signs of isolation. Here, we analyse a sample of N = 977 Sorbs, which is currently included in several genome-wide association studies e.g. [27, 28], and compare the Sorbs with the German outbred population of the KORA study [29]. Using the KORA study (N = 1644) and a larger sample of Sorbs (N = 977) provides more power than previous studies for comparing population genetic patterns between Sorbs and their neighbours. For this purpose, we assess the above mentioned population genetic characteristics: PCA, number of rare SNPs, F-statistics, ROHs, and LD. All analyses are based on genome-wide SNP array data. We also aim to separate effects of cryptic relatedness from effects of genetic isolation.

Furthermore, we analyse how differences between populations can be translated to differences in power of genetic association studies within these samples. We analyse the influence of genetic effect size, LD structure, heritability, and relatedness on power.

Methods

Study Populations

Sorbs

The Sorbs are of Slavonic origin, and lived in ethnic isolation among the Germanic majority during the past 1100 years [1]. Today, the Sorbian-speaking, Catholic minority comprises 15,000 full-blooded Sorbs resident in about 10 villages in rural Upper Lusatia (Oberlausitz), Eastern Saxony. A convenience sample of this population was collected including unrelated subjects as well as families. Details of the study population can be found elsewhere [28, 30]. Genotyping and metabolic phenotyping of this sample was approved by the ethics committee of the University of Leipzig and is in accordance with the declaration of Helsinki. All subjects gave written informed consent before taking part in the study. A subset of individuals were genotyped with either Affymetrix Human Mapping 500 K Array Set (N = 483) or Affymetrix Genome-Wide Human SNP Array 6.0 (N = 494). Details on genotyping are described in [28]. A total of 977 subjects were available after quality control.

KORA

The study population was recruited from the KORA/MONICA S3 survey, a population-based sample from the general population living in the region of Augsburg, Southern Germany, which was carried out in 1994/95. In a follow-up examination of S3 in 2004/05 (KORA F3), 3006 subjects participated. Recruitment and study procedures of KORA have been described elsewhere [29, 31]. For KORA F3 500 K we selected 1644 subjects of these participants then aged 35 to 79 years. Informed consent has been given, and the study has been approved by the local ethics committee. All KORA participants have a German passport. Genotyping of these individuals was performed with the Affymetrix Gene Chip Human Mapping 500 K Array Set as described in [32].

HapMap

174 CEU (CEPH (Centre d'Etude du Polymorphisme Humain) from Utah) and 88 TSI (Toscans in Italy) samples were taken from a recent HapMap Collection (Public Release 27, NCBI build 36, The International HapMap Project). From the CEU sample, we removed 58 children, five individuals with call rate < 90% and one individual because of cryptic relatedness (NA07045 because of lower call-rate compared to NA12813 [33]). In summary, we analysed 110 CEU and 88 TSI samples.

Data Analysis

Genotype Imputation and Quality Control

Missing genotypes of the KORA and Sorb samples were imputed separately using MACH Imputation Software with standard settings [34].

After Imputation, we checked 471,012 autosomal SNPs in the overlap of the Affymetrix Human Mapping 500 K Array Set and Affymetrix Genome-Wide Human SNP Array 6.0 for quality.

SNPs with a call rate less than 95% in all four study populations combined, prior to imputation, were filtered (34,711 SNPs). Hardy-Weinberg-Equilibrium (HWE) was tested across populations using a stratified test proposed by [35]. 10,712 SNPs with p-values less than 10^-6 were eliminated. Finally, 14,508 SNPs showing unexpectedly high differences of allelic frequencies between genotyping platforms in the Sorbs sample were eliminated (p-value < 10^-7, see [1] for further details).

Since several SNPs violated more than one of our criteria, we discarded a total of 46,536 SNPs and analysed 424,476 remaining SNPs.

For estimation of ROHs (see below) the number of analysed SNPs is reduced to 306,081 by matching SNPs on Affymetrix chips with available SNPs in the HapMap CEU and TSI samples. Due to the high sensitivity of the PCA (see below) we decided to tighten our quality criteria for this kind of analysis. Only SNPs with a call rate of at least 99% were included for PCA, which reduced the number of SNPs to 199,702.

An overview of the data pre-processing workflow can be found in Additional file 1.

Estimation of Relatedness

Pair-wise relatedness between all individuals of KORA and Sorbs was estimated by the method described in [36]. For first degree relatives one would expect a value of r = 0.5, for second degree relatives a value of r = 0.25, and so on. Two individuals were considered as unrelated if the pair-wise relatedness estimate was not greater than 0.2, which approximately corresponds to the exclusion of first and second degree relatives.

For analyses of dependence of measures of population genetic comparison on relatedness, we define two subsamples used for all subsequent analyses: For the first subsample, the complete Sorbs sample (Sorbs₉₇₇, N = 977) was matched with a randomly selected subset of N = 977 unrelated KORA subjects born in Germany (KORA₉₇₇). For the second subsample, a subset of N = 532 unrelated Sorbs (Sorbs₅₃₂) was matched with a subset of N = 532 KORA subjects (KORA₅₃₂) randomly selected from KORA₉₇₇.

Unrelated subjects were selected by an algorithm which implements a step-by-step removal of individuals showing the highest number of relationships to other members of the population until no pair of individuals with relatedness > 0.2 remained.

Principal components analysis

PCA is suitable to map genetic variance to a few dimensions expressing the highest degree of variance [16, 22]. It has been shown recently that the application of this technique to genome-wide genetic data is powerful enough to mirror even small geographic distances in Europe [14, 37].

Since PCA results are biased in case of unequal population sizes [38], it was necessary to analyse subsamples of our populations. We performed PCA of 350 individuals from 7 subsamples of size N = 50, generated from the most unrelated individuals of our four study populations. The subsamples were defined as follows. Three subsamples were created from N = 1336, N = 140, and N = 80 individuals from KORA, who were born in Germany, in the Czech Republic, and in Poland, respectively. Two subsamples were generated from the Sorbs grouped by their degree of Sorbian ancestry. We identified 786 "Full"-Sorbs who stated that all four grandparents are Sorbs and 160 "Half"-Sorbs where at least one grandparent was not Sorbian. Another two subsamples were built from 110 CEU and 88 TSI samples.

PCA was done with iterative removal of outliers (default 5 iterations) and LD correction in consecutive SNPs (involving two previous SNPs as recommended in the manual of the EIGENSOFT package).

Rare SNPs

Isolated populations are supposed to have reduced genetic variability resulting in a higher number of rare SNPs. By definition, a SNP has a minor allelic frequency (MAF) of at least 1%. To account for variance we calculated the exact 95% confidence interval of the MAF and considered a SNP as rare if the interval was below one percent. This is equivalent to less than 11 observed alleles in Sorbs₉₇₇ or KORA₉₇₇ and less than five observed alleles in Sorbs₅₃₂ or KORA₅₃₂ respectively. The odds to find rare SNPs were compared between KORA and Sorbs using Fisher's exact test.

F-statistics

To characterize the variance of allelic frequencies within and between populations, we calculated F-statistics.

The inbreeding coefficient F_IS measures the correlation of alleles within an individual relative to the corresponding population. It is calculated by estimating the deviance of the observed number of heterozygote genotypes from what is expected under HWE. For every SNP, we calculated unbiased estimates as presented in [21], assessed the weighted average and determined the standard error of estimates by jack-knifing over individuals.

Correlation of alleles of individuals in the same population was estimated by the co-ancestry coefficient F_ST. Since F_ST quantifies the amount of genetic variation between populations, it is used to define genetic distances between populations. We assessed F_ST for pairs of populations using a combined estimate of all SNPs [21] and calculated the standard error of estimates again by jack-knifing over individuals.

Runs of homozygosity

Counting ROHs is useful to detect inbreeding [18]. ROHs were determined in all individuals from KORA, Sorbs, CEU, and TSI using the PLINK Package (Version 1.07) with standard settings except for two parameters as noted below. PLINK estimates ROHs by searching for contiguous runs of homozygote genotypes. For this purpose, a window (default length 5000 kb, minimum 50 SNPs) is moved along the genome. To account for possible genotyping errors, at each SNP the homozygosity of the window is assessed allowing one (default) heterozygous genotype and five (default) missing calls. For each SNP the proportion of overlapping homozygous windows is calculated. If this proportion is high enough (default 5%) the SNP is considered to be part of a homozygous segment. Only homozygous segments longer than a given threshold (500 kb, default 1000 kb), consisting of a minimum number of 100 SNPs (default) and comprising a minimum SNP density of one SNP per 50 kb (default) were denoted as ROH. A homozygous segment can be split in two if two SNPs are at least 100 kb apart (default 1000 kb). Details on the algorithm can be found on the PLINK Homepage (see URLs).

Linkage disequilibrium

In the Sorbs and KORA samples, we calculated pair-wise LD for all SNPs on Chromosome 22 (5382 markers) using robust estimators [39]. We used the widely accepted measures r[40] and |D'| [41] to quantify LD. Since both measures depend on allelic frequencies, we also used the newly proposed measure |η₁|, which is independent of allelic frequencies. Hence, it is especially useful when comparing populations [42]. The measure η₁ is a monotone function of the odds ratio λ[43] ranging between -1 and 1. It is defined as

Its absolute value is the percentage of SNP pairs under the non-informative uniform distribution with less extreme LD than the one observed (see [42] for details). Measures of LD were averaged using bins of 5 kb length as proposed by Olshen et al. [44]. Resulting means were smoothed by a LOWESS estimator [45].

Comparison of power assuming uncorrelated phenotypes

We analysed how the observed differences in LD structure between KORA and Sorbs can be translated into differences in power of genetic association studies. For this purpose, we assumed a linear regression model y = β₁s₁ + ε₁ of a random phenotype y which is influenced by a genotype s₁ of a causative SNP, and ε₁ is the residual Gaussian error of the model.

The SNP is assumed to explain a pre-specified proportion of the total variance of the phenotype which is denoted as in the following. In consequence, we can assume β₁ = 1 without restriction of generality. Within the distance of ± 2 Mb we now analysed the model y = β₂s₂ + ε₂ for a second SNP, which is in maximum LD (measured by r) with the causative SNP. That is, we analysed the best proxy of the causative SNP rather than the causative SNP itself modelling the marker principle of genetic association studies. The estimator is normally distributed and depends on s₁, s₂, and :

Where n is the number of individuals, s_2i is the genotype of the i-th individual and is the average. The formula is derived in Additional file 2. We calculated the power of the regression analysis, i.e. the probability that the observed p-value is smaller than a given significance level (p-value threshold) when testing against the null hypothesis β₂ = 0 using the above formula. This was done for all SNPs on Chromosome 22 in KORA₉₇₇, KORA₅₃₂, Sorbs₉₇₇, and Sorbs₅₃₂ . Distribution of power was derived using the results of all SNPs of Chromosome 22. Results were compared between the KORA and Sorbs samples of equal size.

Comparison of power assuming correlated phenotypes

In the previous section, we derived formulae for the estimation of power under the assumption of uncorrelated phenotypes. This approach applies for either a negligible relatedness structure of the individuals or a weak correlation of phenotypes of related individuals. Applying a GRAMMAR approach [17], deviations from this situation can be corrected resulting again in the situation considered in the previous section.

However, to our knowledge, it is still not common practice in genome-wide association studies to use this approach to correct for relatedness. Therefore, we aim to study the situation in which the phenotypes are correlated but in which the corresponding individuals were analysed as independent even though they are not.

Following Amin et al. [17], we simulated phenotypes y on the basis of the mixed model y = β₁s₁ + g + ε₁, comprising a fixed effect of genotypes s₁, a random effect representing the residual polygenic effects and non-genetic residuals . Here, G represents the pair-wise relatedness matrix. The model results in non-trivial covariance of phenotypes of different individuals. For each SNP we drew 1000 samples from the model and analysed the linear model y = β₂s₂ + ε₂ for a second SNP which is in maximum LD to the first SNP in complete analogy to the procedure developed for uncorrelated phenotypes (see previous section). Different degrees of heritability were simulated, where is the explained variance by genotypes s₁ and is the explained variance by polygenetic effects g. Providing values for and results in the variance components and , which follow after some calculations.

Statistical Software and Web-Resources

HapMap data were downloaded from [46]. Estimation of Eigenvectors for comparison of all subsamples was done with the EIGENSOFT package (Version 3.0, [47]). ROHs were determined by the PLINK Package (Version 1.07, [48]) [49].

All other calculations were performed using the Statistical Software package R (Version 2.8.0, [50]) [51].

Results

For population genetic comparison of the Sorbian minority in Germany with the German KORA population, several measures of genetic isolation were applied to genome-wide SNP array data.

Relatedness

We analysed the relatedness of all 476,776 pairs of individuals in the Sorbs and all 1,350,546 pairs in the KORA samples. Results are shown in Figure 1. Frequencies of relationships differ remarkably between the two samples. Emphasized by the different scales of the histograms, it can be clearly recognized that the numbers of first and second degree relationships are higher in the Sorbs compared to KORA. Numbers of pairs with estimates over a given threshold are shown in Table 1 for both populations. We also provide odds-ratios for the encounter of a related pair.

Table 1 Distribution of pair-wise relatedness estimates

Full size table

To achieve samples without pairs of individuals with relatedness-estimates greater than 0.2, it was necessary to exclude 445 Sorbs and 33 KORA individuals, resulting in subsamples of 532 Sorbs and 1,611 KORA individuals.

Principal components analysis

Results of PCA after removal of outliers and LD correction are shown in Figure 2. The figure comprises all 150 individuals from KORA, 97 Sorbs, 49 HapMap CEU and 48 HapMap TSI after outlier removal.

A plot of the genetic variance represented by the first two principal components impressively reflects the geographic origin of these populations. TSI samples are relatively far away from the other clusters giving an orientation of a north to south axis. The KORA population is very close to the CEU HapMap population. In contrast, the Sorbian population clusters significantly eastwardly. There is a clear trend of west to east clustering of KORA individuals born in Germany, KORA individuals born in Poland or Czech Republic, Half-Sorbs, and finally, Full-Sorbs. The Sorbs clusters are nearest to the cluster of KORA individuals born in Poland.

Rare SNPs

When analysing 424,476 quality SNPs in 977 Sorbs (Sorbs₉₇₇) and the random Sample of 977 individuals from KORA (KORA₉₇₇), we counted 51,204 rare SNPs in Sorbs₉₇₇ and 49,721 rare SNPs in KORA₉₇₇ (p-value 6.7 × 10^-7). In the subset of 532 unrelated Sorbs (Sorbs₅₃₂) and the random sample of 532 unrelated individuals from KORA (KORA₅₃₂), we counted again more rare SNPs in the Sorbs₅₃₂ than in KORA₅₃₂, i.e. 49,257 and 47,913 (p-value 4.7 × 10^-6), respectively.

F-Statistics

Estimating F_IS in the samples KORA₉₇₇ and KORA₅₃₂ resulted in slightly positive values with the smaller value in KORA₉₇₇. In contrast, in the samples Sorbs₉₇₇ and Sorbs₅₃₂, we find slightly negative values with smaller value in the sample Sorbs₉₇₇.

F_ST estimates are somewhat higher between KORA₉₇₇ and Sorbs₉₇₇ than between KORA₅₃₂ and Sorbs₅₃₂. F_ST estimates are higher than corresponding F_IS estimates, indicating a clear genetic distance between the two populations. All statistics can be found in Table 2.

Table 2 Inbreeding and co-ancestry coefficients

Full size table

Runs of Homozygosity

ROHs were determined for the populations KORA, Sorbs₉₇₇, Sorbs₅₃₂, CEU, and TSI. Percentages of individuals in these populations containing at least one ROH in a specified length interval were calculated (Figure 3). Compared to the other populations, Sorbs show a higher proportion of individuals with ROHs between 2.5 Mb and 5 Mb.

In a second step, mean total length of ROHs with a given minimum length was estimated averaged over the individuals of each population (Figure 4). Again, Sorbs differ from the other populations and are characterized by higher mean total length of ROHs. However, the effect is less pronounced if only long ROHs are considered. The mean total length of ROHs is shorter for Sorbs₅₃₂ than for Sorbs₉₇₇ but the difference is small.

Linkage Disequilibrium

Three measures of LD were calculated for KORA₉₇₇, KORA₅₃₂, Sorbs₉₇₇, and Sorbs₅₃₂. Results of η₁ are shown in Figure 5. Other measures such as r and D' behave similarly (data not shown). LD in the KORA sample is markedly lower at long ranges compared to Sorbs. This result is robust against dropping related individuals in the Sorb sample.

As expected for KORA₉₇₇ and KORA₅₃₂ a small sample size bias can be observed. In contrast the estimators for Sorbs₉₇₇ and Sorbs₅₃₂ are virtually identical.

Comparison of power assuming uncorrelated phenotypes

The power to detect causal SNPs was calculated for KORA₉₇₇, KORA₅₃₂, Sorbs₉₇₇, and Sorbs₅₃₂. Results for SNP effects with explained variances of 2% or 5% can be found in Figure 6. Since the results are virtually identical for KORA and Sorbs, we present the quartiles of the power distribution in Table 3 for p-value thresholds of 1 × 10^-5 and 1 × 10^-7.

Table 3 Quartiles of power distribution assuming uncorrelated phenotypes

Full size table

Comparison of power assuming correlated phenotypes

In Table 4 we present the power estimates assuming a heritability of 100% resulting in the greatest differences compared to Table 3. However, except for Sorbs₉₇₇ , there are only very small differences between Tables 3 and 4 and even for Sorbs₉₇₇ the differences appear to be not substantial. For an explained variance of 2%, the power in Sorbs₉₇₇ increases, but it decreases for an explained variance of 5%. This is due to dependence on the significance threshold. Independent of the explained variance of the SNPs, the power under maximum heritability (100%) is greater than under minimal heritability () for small p-value thresholds. But for large p-value thresholds, the opposite is true (see Additional file 3).

Table 4 Quartiles of power distribution assuming correlated phenotypes

Full size table

The explanation for this behaviour is the inflation of the variance of the β-estimator caused by high levels of relatedness in the Sorbs₉₇₇ sample (see Additional file 4).

Results for other degrees of heritability are presented in Additional file 5. As expected, in the case of minimal heritability the results of our simulations under the mixed model and the results obtained with our analytical formula used in the previous section are coincident.

Discussion

The Sorbs, resident in Lusatia, Germany, are an ethnic minority of Slavonic origin. Using genome-wide SNP array techniques, we aimed to compare this putatively isolated population with a German mixed population (KORA study) by various population genetic means. The Sorbs were compared recently with other European populations or isolates on the basis of a limited set of genetic markers and a limited set of unrelated individuals [1, 52]. In the present analysis, we studied the Sorbs from the perspective of ongoing genome-wide association studies. That is, we compared the population with a German mixed population on the basis of complete sets of genotyped individuals, and a large number of genotyped SNPs. We also aimed to separate the effect of isolation from potential effects caused by over-sampling of relatives in the Sorbs. Finally, we studied the implications of observed differences between KORA and Sorbs for the analysis, and especially, the power of genome-wide association studies.

Genotype data from a sample of 977 Sorbs were available from genotyping with 500 k and 1000 k Affymetrix SNP chips. While SNP markers come with certain drawbacks (ascertainment bias, need for careful QC), they have proven useful for detecting subtle population structures.

For comparison with a German mixed population, we used the KORA F3 sample (N = 1644) and corresponding genotypes from 500 k Affymetrix SNP chips. Observed differences between regions of Germany are typically an order of magnitude lower than differences observed between Sorbs and KORA [53]. Publicly available European-American HapMap samples were also included in the analysis.

A major goal of our study was to distinguish effects of genetic isolation from simple over-sampling of families in the Sorbs. Since most of the population genetic measures used to compare populations assume independence of individuals, over-sampling of families in certain samples may introduce a source of bias which is difficult to control. Indeed, we discovered a large number of closely related individuals within the Sorbs sample. Therefore, we repeated all analyses for a sub-group of Sorbs for which all relationships with relatedness estimates greater than 0.2 were removed. This does not completely resolve the problem of increased relatedness within the Sorbs sample but provides a trend for potential biases introduced by over-sampling of families. Indeed, such biases could be detected in our data but it is not substantial at least for the population genetic measures studied.

Since relatedness cannot be completely removed from the samples, a cut-off of 0.2 for the relatedness estimate seems to be feasible to study the effect of relatedness and to keep the sample size at an acceptable level. We also studied a cut-off of 0.1 reducing the sample size to N = 414. Results can be found in Additional file 6. Although tending slightly towards zero, results are essentially the same as those obtained for the cut-off of 0.2.

For some analyses such as determination of rare SNPs and LD it is known that sample size can introduce bias [39, 44, 54]. Therefore, for most comparisons we used randomly drawn subsamples of KORA which are of the same size as the Sorbs samples.

PCA is a proven means to detect even very small genetic differences between populations with high power. For European populations, it was demonstrated that the first two appropriately scaled principal components can map individuals to their geographic origin on the European continent with high precision, when all four grandparents are from the same location [14]. Our PCA results showed clear distances between KORA, Sorbs, and individuals from Tuscany. Using individuals from KORA and Tuscany to roughly orient the PCA graph on a map of Europe, Sorbs are positioned towards the East. KORA individuals are very close to the CEU HapMap population, while the distance to Tuscan/TSI individuals is much larger.

We conclude that the Slavonic origin of the Sorbs is still clearly genetically detectable. The analysis revealed that there is a west to east sequence of the clusters of KORA individuals born in Germany, KORA individuals born in Poland or Czech Republic, Half-Sorbs, and finally, Full-Sorbs. Although birthplace is not a stringent indicator of ethnicity, it is a commonly used surrogate in genetic epidemiologic studies if more detailed information cannot be ascertained. On the other hand, most of the KORA individuals born in Poland or Czech Republic are descendents from German minorities of these countries. Hence, on the basis of our data we cannot conclude that the Sorbs are genetically more distant from Germany than a random sample from Poland or Czech Republic. Half-Sorbs can be assumed to be closer to the German population than Full-Sorbs due to mating with German neighbours. This is clearly reflected by the localization of Half-Sorbs between KORA individuals and Full-Sorbs. There is a trend that the Sorbs are closer to the KORA individuals born in Poland than to the KORA individuals born in Czech Republic which is in agreement with a recently stated hypothesis that the Sorbs are genetically closer to Polish than to Czech [1].

Since it has been suggested that genetic diversity is lower in isolated populations [6], we analysed the number of rare SNPs. Indeed, we found a higher number of rare SNPs in the Sorbs sample compared to the KORA sample. Although significant, the difference is small in size.

The F_ST statistics between KORA and Sorbs were an order of magnitude higher than usually observed between different regions of Germany [53]. Thus, variance between KORA and Sorbs is much higher than expected for different regions in Germany. Surprisingly, the F_IS statistic was positive for KORA but negative for Sorbs. Such a phenomenon has also been observed for other isolated populations, suggesting that there may be signs of recent isolation breaking in the Sorbs [44]. Another indicator of isolation breaking is the relatively high number of Half-Sorbs (N = 160) in the present sample, i.e. subjects who claim to have less than four Sorbian grandparents. It should be remarked that the F_IS statistic is a population based measure rather than an individual based measure of inbreeding studied in [1].

ROH analysis was proposed to detect signs of isolation by estimation of inbreeding [18]. Despite the simplicity of this concept, calculation of ROH depends on many variable parameter settings such as SNP density or allowed numbers of missings or heterozygous markers, which heavily influence the results. Parameter settings are extensively discussed in McQuillan et al [18]. For our analysis, we used the default settings of PLINK except for two parameters: The threshold for homozygous segments was 500 kb (PLINK default is 1000 kb) and the splitting of homozygous segments can occur if two neighbouring SNPs are 100 kb apart (PLINK default is 1000 kb). Hence, we used the same settings as in McQuillan et al. except for the minimum number of contiguous homozygous SNPs constituting a ROH, for which we kept the PLINK default (N = 100). The results of ROH analysis also depend on allelic frequencies of populations and SNP-selections used by different genotyping technologies. Since McQuillan et al. [18] used a different genotyping platform (Illumina Infinium HumanHap300v2), the latter modification was necessary to obtain similar results.

We found that Sorbs have enriched ROHs of intermediate length (between 2.5 Mb and 5 Mb) compared to KORA, CEU, and TSI. This effect is much less pronounced for longer ROHs. Accordingly, the coverage of the genome by ROHs is higher in the Sorbian population. Following the argumentation of McQuillan et al., we conclude that there is a lack of recent parental relatedness in the Sorbs (no differences for long range ROHs) but that there are signs of ancient parental relatedness or the existence of autozygous segments of older pedigree structures (differences for ROHs of intermediate range). The lack of direct parental relatedness is in accordance with our estimates of F_IS .

Furthermore, we compared the LD structure of chromosome 22 between the KORA and the Sorbs population. We used the newly proposed LD measure η₁ for the comparison of KORA and Sorbs. In contrast to the more popular measures r and D', the measure η₁ is independent of allelic frequencies [42]. In our opinion, this property is desirable when comparing LD structure between populations of potentially differing allelic frequencies. However, the results obtained by the three measures are very similar (data not shown).

An expected small upward bias caused by smaller sample size in KORA₅₃₂ compared to KORA₉₇₇ could be clearly detected. In contrast, the results for Sorbs₉₇₇ and Sorbs₅₃₂ are virtually identical. We conclude that the expected upward bias of the reduced Sorbs₅₃₂ sample is nullified by the elimination of relationships. This interpretation is supported by the fact that a random sample of N = 532 individuals from Sorbs₉₇₇ resulted in the same sample size bias as observed for KORA (data not shown). That is, LD is upwardly biased by the relatedness structure in the Sorbs. Nevertheless, even if relationships are eliminated to a reasonable degree (first and second degree relationships), Sorbs show generally higher LD at longer distances than is observed in KORA. It has been already shown in the literature that LD excess at longer ranges is a characteristic of isolated populations [5, 9–11]. However, the effect is moderate in size which is also in agreement with several other populations considered as isolated [44, 55–57].

Since LD structure directly influences the coverage of a SNP technology, and with it, the power of genome-wide association studies, we performed power analyses in the Sorbs and KORA samples. For this purpose, we defined a fixed genetic effect of an arbitrary SNP at chromosome 22. Explained variance was used as a measure of effect in order to adjust for differences in allelic frequencies. For this SNP, we analysed the best proxy SNP available on chromosome 22 in order to mimic a situation in which an unobserved causative variant is detected via a marker in LD. We derived an analytical formula for our model for the case of negligible heritability for which individuals can be considered as independent. This formula also applies to situations where correction for relatedness effects has been performed, for instance with a GRAMMAR approach [17]. Power was calculated for all SNPs on chromosome 22 and the resulting distribution was compared between the Sorbs and KORA samples with and without relatives. No differences regarding power were detected. We conclude that there is no gain in power due to higher LD in the Sorbs.

Since relatedness structure is often neglected in genetic association studies, we also analysed the influence of present relatedness structure on the power of an uncorrected analysis. This analysis is done via simulations of a linear mixed model comprising a fixed effect of a SNP and random polygenetic and non-genetic effects. We showed that the variance of the β-estimator is inflated under relatedness and high heritability. This results in a gain in power for higher p-value thresholds and a loss of power for lower p-value thresholds in the Sorbs₉₇₇, irrespective of the size of the genetic effect considered. The explanation is that normal distributions with different variances are overlapping.

We conclude that relatedness in the Sorbs₉₇₇ sample influences the power of uncorrected genetic association studies. Influence of relatedness on power is highest under maximum heritability of the phenotype. However, directions of power differences depend on the size of the genetic effect in combination with the significance threshold chosen.

In our simulations we did not observe a scenario resulting in a clear power benefit in the Sorbs₉₇₇ sample. However, this does not rule out that there might be a higher power in the Sorbs due to increased effect sizes caused, e.g., by higher environmental homogeneity or lower number of causative variants [7, 8].

Conclusions

We could show that there are signs of genetic isolation within the Sorbs which cannot be explained by over-sampling of relatives. The effects are moderate in size. The Slavonic origin of the Sorbs is still genetically detectable. Although there is higher LD in the Sorbs, the difference to KORA is small. Power analysis showed that a clear advantage of the Sorbs for genome-wide association studies with respect to coverage cannot be expected.

The significant amount of cryptic relatedness in the Sorbs sample results in inflated variances of β-estimators which should be considered in genetic association analyses.

References

Veeramah KR, Tonjes A, Kovacs P, Gross A, Wegmann D, Geary P, Gasperikova D, Klimes I, Scholz M, Novembre J: Genetic variation in the Sorbs of eastern Germany in the context of broader European genetic diversity. European Journal of Human Genetics. 2011
Google Scholar
Abbott A: Manhattan versus Reykjavik. Nature. 2000, 406 (6794): 340-342. 10.1038/35019167.
Article CAS PubMed Google Scholar
Eaves IA, Merriman TR, Barber RA, Nutland S, Tuomilehto-Wolf E, Tuomilehto J, Cucca F, Todd JA: The genetically isolated populations of Finland and sardinia may not be a panacea for linkage disequilibrium mapping of common disease genes. Nat Genet. 2000, 25 (3): 320-323. 10.1038/77091.
Article CAS PubMed Google Scholar
Taillon-Miller P, Bauer-Sardina I, Saccone NL, Putzel J, Laitinen T, Cao A, Kere J, Pilia G, Rice JP, Kwok PY: Juxtaposed regions of extensive and minimal linkage disequilibrium in human Xq25 and Xq28. Nat Genet. 2000, 25 (3): 324-328. 10.1038/77100.
Article CAS PubMed Google Scholar
Shifman S, Darvasi A: The value of isolated populations. Nat Genet. 2001, 28 (4): 309-310. 10.1038/91060.
Article CAS PubMed Google Scholar
Kristiansson K, Naukkarinen J, Peltonen L: Isolated populations and complex disease gene identification. Genome Biol. 2008, 9 (8): 109-10.1186/gb-2008-9-8-109.
Article PubMed Central PubMed Google Scholar
Sheffield VC, Stone EM, Carmi R: Use of isolated inbred human populations for identification of disease genes. Trends Genet. 1998, 14 (10): 391-396. 10.1016/S0168-9525(98)01556-X.
Article CAS PubMed Google Scholar
Arcos-Burgos M, Muenke M: Genetics of population isolates. Clin Genet. 2002, 61 (4): 233-247. 10.1034/j.1399-0004.2002.610401.x.
Article CAS PubMed Google Scholar
Tenesa A, Wright AF, Knott SA, Carothers AD, Hayward C, Angius A, Persico I, Maestrale G, Hastie ND, Pirastu M: Extent of linkage disequilibrium in a Sardinian sub-isolate: sampling and methodological considerations. Hum Mol Genet. 2004, 13 (1): 25-33.
Article CAS PubMed Google Scholar
Service S, DeYoung J, Karayiorgou M, Roos JL, Pretorious H, Bedoya G, Ospina J, Ruiz-Linares A, Macedo A, Palha JA: Magnitude and distribution of linkage disequilibrium in population isolates and implications for genome-wide association studies. Nat Genet. 2006, 38 (5): 556-560. 10.1038/ng1770.
Article CAS PubMed Google Scholar
Angius A, Hyland FC, Persico I, Pirastu N, Woodage T, Pirastu M, De la Vega FM: Patterns of linkage disequilibrium between SNPs in a Sardinian population isolate and the selection of markers for association studies. Hum Hered. 2008, 65 (1): 9-22. 10.1159/000106058.
Article PubMed Google Scholar
Rosenberg NA, Pritchard JK, Weber JL, Cann HM, Kidd KK, Zhivotovsky LA, Feldman MW: Genetic structure of human populations. Science. 2002, 298 (5602): 2381-2385. 10.1126/science.1078311.
Article CAS PubMed Google Scholar
Jakobsson M, Scholz SW, Scheet P, Gibbs JR, VanLiere JM, Fung HC, Szpiech ZA, Degnan JH, Wang K, Guerreiro R: Genotype, haplotype and copy-number variation in worldwide human populations. Nature. 2008, 451 (7181): 998-1003. 10.1038/nature06742.
Article CAS PubMed Google Scholar
Novembre J, Johnson T, Bryc K, Kutalik Z, Boyko AR, Auton A, Indap A, King KS, Bergmann S, Nelson MR: Genes mirror geography within Europe. Nature. 2008, 456 (7218): 98-101. 10.1038/nature07331.
Article PubMed Central CAS PubMed Google Scholar
Lopez Herraez D, Bauchet M, Tang K, Theunert C, Pugach I, Li J, Nandineni MR, Gross A, Scholz M, Stoneking M: Genetic variation and recent positive selection in worldwide human populations: evidence from nearly 1 million SNPs. PLoS One. 2009, 4 (11): e7888-10.1371/journal.pone.0007888.
Article PubMed Central PubMed Google Scholar
Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D: Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006, 38 (8): 904-909. 10.1038/ng1847.
Article CAS PubMed Google Scholar
Amin N, van Duijn CM, Aulchenko YS: A genomic background based method for association analysis in related individuals. PLoS One. 2007, 2 (12): e1274-10.1371/journal.pone.0001274.
Article PubMed Central PubMed Google Scholar
McQuillan R, Leutenegger AL, Abdel-Rahman R, Franklin CS, Pericic M, Barac-Lauc L, Smolej-Narancic N, Janicijevic B, Polasek O, Tenesa A: Runs of homozygosity in European populations. Am J Hum Genet. 2008, 83 (3): 359-372. 10.1016/j.ajhg.2008.08.007.
Article PubMed Central CAS PubMed Google Scholar
Peltonen L, Jalanko A, Varilo T: Molecular genetics of the Finnish disease heritage. Hum Mol Genet. 1999, 8 (10): 1913-1923. 10.1093/hmg/8.10.1913.
Article CAS PubMed Google Scholar
Peltonen L: Positional cloning of disease genes: advantages of genetic isolates. Hum Hered. 2000, 50 (1): 66-75. 10.1159/000022892.
Article CAS PubMed Google Scholar
Weir BS: Genetic Data Analysis II. 1996, Sunderland, MA: Sinauer Associates, Inc
Google Scholar
Patterson N, Price AL, Reich D: Population structure and eigenanalysis. PLoS Genet. 2006, 2 (12): e190-10.1371/journal.pgen.0020190.
Article PubMed Central PubMed Google Scholar
Choi Y, Wijsman EM, Weir BS: Case-control association testing in the presence of unknown relationships. Genet Epidemiol. 2009, 33 (8): 668-678. 10.1002/gepi.20418.
Article PubMed Central PubMed Google Scholar
Zhang F, Deng HW: Correcting for cryptic relatedness in population-based association studies of continuous traits. Hum Hered. 2010, 69 (1): 28-33. 10.1159/000243151.
Article PubMed Google Scholar
Thornton T, McPeek MS: ROADTRIPS: case-control association testing with partially or completely unknown population and pedigree structure. Am J Hum Genet. 2010, 86 (2): 172-184. 10.1016/j.ajhg.2010.01.001.
Article PubMed Central CAS PubMed Google Scholar
Krawczak M, Lu TT, Willuweit S, Roewer L: Genetic diversity in the German population. Handbook of Human Molecular Evolution. 2008, John Wiley & Sons
Google Scholar
Kottgen A, Pattaro C, Boger CA, Fuchsberger C, Olden M, Glazer NL, Parsa A, Gao X, Yang Q, Smith AV: New loci associated with kidney function and chronic kidney disease. Nat Genet. 2010
Google Scholar
Tonjes A, Koriath M, Schleinitz D, Dietrich K, Bottcher Y, Rayner NW, Almgren P, Enigk B, Richter O, Rohm S: Genetic variation in GPR133 is associated with height: genome wide association study in the self-contained population of Sorbs. Hum Mol Genet. 2009, 18 (23): 4662-4668. 10.1093/hmg/ddp423.
Article PubMed Central PubMed Google Scholar
Wichmann HE, Gieger C, Illig T: KORA-gen--resource for population genetics, controls and a broad spectrum of disease phenotypes. Gesundheitswesen. 2005, 67 (Suppl 1): S26-30.
Article PubMed Google Scholar
Tonjes A, Zeggini E, Kovacs P, Bottcher Y, Schleinitz D, Dietrich K, Morris AP, Enigk B, Rayner NW, Koriath M: Association of FTO variants with BMI and fat mass in the self-contained population of Sorbs in Germany. Eur J Hum Genet. 2010, 18 (1): 104-110. 10.1038/ejhg.2009.107.
Article PubMed Central PubMed Google Scholar
Holle R, Happich M, Lowel H, Wichmann HE: KORA--a research platform for population based health research. Gesundheitswesen. 2005, 67 (Suppl 1): S19-25.
Article PubMed Google Scholar
Doring A, Gieger C, Mehta D, Gohlke H, Prokisch H, Coassin S, Fischer G, Henke K, Klopp N, Kronenberg F: SLC2A9 influences uric acid concentrations with pronounced sex-specific effects. Nat Genet. 2008, 40 (4): 430-436. 10.1038/ng.107.
Article PubMed Google Scholar
Pemberton TJ, Wang C, Li JZ, Rosenberg NA: Inference of unexpected genetic relatedness among individuals in HapMap Phase III. Am J Hum Genet. 2010, 87 (4): 457-464. 10.1016/j.ajhg.2010.08.014.
Article PubMed Central CAS PubMed Google Scholar
Li Y, Willer C, Sanna S, Abecasis G: Genotype imputation. Annu Rev Genomics Hum Genet. 2009, 10: 387-406. 10.1146/annurev.genom.9.081307.164242.
Article PubMed Central CAS PubMed Google Scholar
Troendle JF, Yu KF: A note on testing the Hardy-Weinberg law across strata. Ann Hum Genet. 1994, 58 (Pt 4): 397-402.
Article CAS PubMed Google Scholar
Wang J: An estimator for pairwise relatedness using molecular markers. Genetics. 2002, 160 (3): 1203-1215.
PubMed Central CAS PubMed Google Scholar
Lao O, Lu TT, Nothnagel M, Junge O, Freitag-Wolf S, Caliebe A, Balascakova M, Bertranpetit J, Bindoff LA, Comas D: Correlation between genetic and geographic structure in Europe. Curr Biol. 2008, 18 (16): 1241-1248. 10.1016/j.cub.2008.07.049.
Article CAS PubMed Google Scholar
McVean G: A genealogical interpretation of principal components analysis. PLoS Genet. 2009, 5 (10): e1000686-10.1371/journal.pgen.1000686.
Article PubMed Central PubMed Google Scholar
Scholz M, Hasenclever D: Comparison of Estimators for Measures of Linkage Disequilibrium. The International Journal of Biostatistics. 2010, 6 (1):
Hill WG, Robertson A: Linkage Disequilibrium in Finite Populations. Theoretical and Applied Genetics. 1968, 38: 226-231. 10.1007/BF01245622.
Article CAS PubMed Google Scholar
Lewontin RC: The Interaction of Selection and Linkage. I. General Considerations; Heterotic Models. Genetics. 1964, 49 (1): 49-67.
PubMed Central CAS PubMed Google Scholar
A Canonical Measure of Allelic Association. [http://arxiv.org/PS_cache/arxiv/pdf/0903/0903.3886v1.pdf]
Edwards AWF: The Measure of Association in a 2 × 2 Table. Journal of the Royal Statistical Society, Series A. 1963, 126: 108-114.
Article Google Scholar
Olshen AB, Gold B, Lohmueller KE, Struewing JP, Satagopan J, Stefanov SA, Eskin E, Kirchhoff T, Lautenberger JA, Klein RJ: Analysis of genetic variation in Ashkenazi Jews by high density SNP genotyping. BMC Genet. 2008, 9: 14-
Article PubMed Central PubMed Google Scholar
Cleveland WS: Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association. 1979, 74: 829-836. 10.2307/2286407.
Article Google Scholar
International HapMap Project. [http://hapmap.ncbi.nlm.nih.gov/]
EIGENSOFT Package. [http://genepath.med.harvard.edu/~reich/Software.htm]
PLINK Package. [http://pngu.mgh.harvard.edu/purcell/plink/]
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ: PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007, 81 (3): 559-575. 10.1086/519795.
Article PubMed Central CAS PubMed Google Scholar
The R Project. [http://www.r-project.org/]
R: A Language and Environment for Statistical Computing. [http://www.R-project.org]
Rodig H, Grum M, Grimmecke HD: Population study and evaluation of 20 Y-chromosome STR loci in Germans. Int J Legal Med. 2007, 121 (1): 24-27.
Article PubMed Google Scholar
Steffens M, Lamina C, Illig T, Bettecken T, Vogler R, Entz P, Suk EK, Toliat MR, Klopp N, Caliebe A: SNP-based analysis of genetic substructure in the German population. Hum Hered. 2006, 62 (1): 20-29. 10.1159/000095850.
Article CAS PubMed Google Scholar
Chen Y, Lin CHL, Sabatti C: Volume Measures for Linkage Disequilibrium. BMC Genetics. 2006, 7 (54):
Kruglyak L: Genetic isolates: separate but equal?. Proc Natl Acad Sci USA. 1999, 96 (4): 1170-1172. 10.1073/pnas.96.4.1170.
Article PubMed Central CAS PubMed Google Scholar
Shifman S, Kuypers J, Kokoris M, Yakir B, Darvasi A: Linkage disequilibrium patterns of the human genome across populations. Hum Mol Genet. 2003, 12 (7): 771-776. 10.1093/hmg/ddg088.
Article CAS PubMed Google Scholar
Bosch E, Laayouni H, Morcillo-Suarez C, Casals F, Moreno-Estrada A, Ferrer-Admetlla A, Gardner M, Rosa A, Navarro A, Comas D: Decay of linkage disequilibrium within genes across HGDP-CEPH human samples: most population isolates do not show increased LD. BMC Genomics. 2009, 10: 338-10.1186/1471-2164-10-338.
Article PubMed Central PubMed Google Scholar

Download references

Acknowledgements

We thank Knut Krohn and Beate Enigk for conducting microarray experiments of the Sorbs sample at the IZKF Leipzig at the Faculty of Medicine of the University of Leipzig (Projekt Z03).

We gratefully acknowledge the contributions of P. Lichtner, G. Eckstein, Guido Fischer, T. Strom and all other members of the Helmholtz Centre Munich genotyping staff in generating the SNP dataset as well as the contribution of all members of field staffs who were involved in the planning and conduct of the MONICA/KORA Augsburg studies. The KORA group consists of H.E. Wichmann (speaker), A. Peters, C. Meisinger, T. Illig, R. Holle, J. John and their co-workers who are responsible for the design and conduct of the KORA studies.

We thank Maelle Salmon for helping with data quality control. We thank Karsten Krug and Lars Thielecke for their technical assistance.

Finally, we express our appreciation to all participants of the Sorb and the KORA study for donating their blood and time.

Funding

The KORA research platform (KORA: Cooperative Research in the Region of Augsburg) and the MONICA Augsburg studies (Monitoring trends and determinants on cardiovascular diseases) were initiated and financed by the Helmholtz Zentrum München-National Research Center for Environmental Health, which is funded by the German Federal Ministry of Education, Science, Research and Technology and by the State of Bavaria. Part of this work was financed by the German National Genome Research Network (NGFN). Our research was supported within the Munich Center of Health Sciences (MC Health) as part of LMUinnovativ. AT, PK and MStu received financial support from the German Research Council (KFO-152), IZKF (B27), and the German Diabetes Association. MSto is funded by the Max Planck Society. AG and PA are funded by the German Federal Ministry for Education and Research (01KN0702). AG, PA, NRR, and MSch were funded by the Leipzig Interdisciplinary Research Cluster of Genetic Factors, Clinical Phenotypes, and Environment (LIFE Center, Universität Leipzig). LIFE is funded by means of the European Union, by the European Regional Development Fund (ERDF), the European Social Fund (ESF), and by means of the Free State of Saxony within the framework of its excellence initiative.

Author information

Authors and Affiliations

Institute for Medical Informatics, Statistics and Epidemiology, University of Leipzig, Haertelstrasse 16-18, 04107, Leipzig, Germany
Arnd Gross, Peter Ahnert, Nab R Roshyara, Markus Loeffler & Markus Scholz
LIFE Center (Leipzig Interdisciplinary Research Cluster of Genetic Factors, Phenotypes and Environment), University of Leipzig, Philipp-Rosenthal Strasse 27, 04103, Leipzig, Germany
Arnd Gross, Peter Ahnert, Nab R Roshyara, Markus Loeffler & Markus Scholz
Department of Medicine, University of Leipzig, Liebigstrasse 18, 04103, Leipzig, Germany
Anke Tönjes & Michael Stumvoll
IFB Adiposity Diseases, University of Leipzig, Stephanstrasse 9c, 04103, Leipzig, Germany
Anke Tönjes & Michael Stumvoll
Interdisciplinary Center for Clinical Research, University of Leipzig, Liebigstrasse 21, 04103, Leipzig, Germany
Peter Kovacs
Dept Eco & Evo Biol, Interdepartmental Program in Bioinformatics, University of California, 621 Charles E. Young Dr South, Box 951606, Los Angeles, Los Angeles, CA, 90095-1606, USA
Krishna R Veeramah & John Novembre
Center for Society and Genetics, University of California, 1323 Rolfe Hall, Box 957221, Los Angeles, Los Angeles, CA, 90095-7221, USA
Krishna R Veeramah
Dept of History, University of California, 6265 Bunche Hall, Box 951473, Los Angeles, Los Angeles, CA, 90095-1473, USA
Krishna R Veeramah
Helmholtz Centre Munich, German Research Center for Environmental Health, Institute of Epidemiology, Ingolstaedter Landstraße 1, 85764, Neuherberg, Germany
Christian Gieger, Ina-Maria Rueckert & Heinz-Erich Wichmann
Max Planck Institute for Evolutionary Anthropology, Deutscher Platz 6, 04103, Leipzig, Germany
Mark Stoneking
Institute of Medical Informatics, Biometry and Epidemiology, Chair of Epidemiology, Ludwig-Maximilians-University, Marchioninistraße 15, 81377, Munich, Germany
Heinz-Erich Wichmann
Klinikum Grosshadern, Ludwig Maximilians University, Marchioninistraße 15, 81377, Munich, Germany
Heinz-Erich Wichmann

Authors

Arnd Gross
View author publications
You can also search for this author in PubMed Google Scholar
Anke Tönjes
View author publications
You can also search for this author in PubMed Google Scholar
Peter Kovacs
View author publications
You can also search for this author in PubMed Google Scholar
Krishna R Veeramah
View author publications
You can also search for this author in PubMed Google Scholar
Peter Ahnert
View author publications
You can also search for this author in PubMed Google Scholar
Nab R Roshyara
View author publications
You can also search for this author in PubMed Google Scholar
Christian Gieger
View author publications
You can also search for this author in PubMed Google Scholar
Ina-Maria Rueckert
View author publications
You can also search for this author in PubMed Google Scholar
Markus Loeffler
View author publications
You can also search for this author in PubMed Google Scholar
Mark Stoneking
View author publications
You can also search for this author in PubMed Google Scholar
Heinz-Erich Wichmann
View author publications
You can also search for this author in PubMed Google Scholar
John Novembre
View author publications
You can also search for this author in PubMed Google Scholar
Michael Stumvoll
View author publications
You can also search for this author in PubMed Google Scholar
Markus Scholz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Markus Scholz.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

Design of the Study: MSch. Design of the Sorbs study and data collection: AT, PK, MStu. Design of the KORA data collection: CG, IR, HW. Data analysis: AG, NRR, MSch. Writing: AG, MSch. Contribution to writing and discussion: KRV, PA, ML, MSto, AT, PK, MStu, JN.

All authors read and approved the final manuscript.

Electronic supplementary material

12863_2011_928_MOESM1_ESM.PDF

Additional file 1:Workflow of data pre-processing. The workflow of data pre-processing is presented. We start with the autosomal SNP data of four different populations (KORA, Sorbs, HapMap CEU, HapMap TSI). Numbers of remaining markers at each step of pre-processing are presented in bold. (PDF 14 KB)

Additional file 2:Derivation of the formula for . (PDF 55 KB)

12863_2011_928_MOESM3_ESM.PDF

Additional file 3:Comparisons of power for Sorbs₉₇₇ for minimal and maximal heritability of phenotypes. Simulation results of the power for minimal () and maximal (100%) heritability. For the minimal heritability, we present the results of our analytical formula. The values presented in Tables 3 and 4 are displayed in bold. (PDF 7 KB)

12863_2011_928_MOESM4_ESM.PDF

Additional file 4:Variance inflation under relatedness. Comparison of the theoretical variance of the β₁-estimator assuming uncorrelated phenotypes (analytical formula ) with the averaged variances over all SNPs of chromosome 22 under a heritability of 100% assuming correlated phenotypes. The standard error of this estimate and the inflation factor are also provided. Sorbs₉₇₇ are presented in bold due to high inflation of variances of β₁-estimates. (PDF 12 KB)

12863_2011_928_MOESM5_ESM.PDF

Additional file 5:Simulation results for power under assumption of correlated phenotypes. Heritability was modified between and 100%. Explained variances of the SNP are 2% or 5% with corresponding p-value thresholds of 10^-5 and 10^-7, respectively. All simulations were performed for KORA₉₇₇, Sorbs₉₇₇, KORA₅₃₂, and Sorbs₅₃₂. Power distribution is derived using the results of all SNPs of Chromosome 22. (PDF 10 KB)

12863_2011_928_MOESM6_ESM.PDF

Additional file 6:Additional inbreeding and co-ancestry coefficients. Estimates and standard errors (SE) of inbreeding coefficients F_IS and co-ancestry coefficients F_ST for KORA and Sorbs and different levels of relatedness: without filtering for relatedness (KORA₉₇₇, Sorbs₉₇₇), filtering for relatedness > 0.2 (KORA₅₃₂, Sorbs₅₃₂), filtering for relatedness > 0.1 (KORA₄₁₄, Sorbs₄₁₄). Indices refer to resulting numbers of cases. (PDF 8 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Gross, A., Tönjes, A., Kovacs, P. et al. Population-genetic comparison of the Sorbian isolate population in Germany with the German KORA population using genome-wide SNP arrays. BMC Genet 12, 67 (2011). https://doi.org/10.1186/1471-2156-12-67

Download citation

Received: 14 March 2011
Accepted: 28 July 2011
Published: 28 July 2011
DOI: https://doi.org/10.1186/1471-2156-12-67

Population-genetic comparison of the Sorbian isolate population in Germany with the German KORA population using genome-wide SNP arrays

Abstract

Background

Results

Conclusions

Background

Methods

Study Populations

Sorbs

KORA

HapMap

Data Analysis

Genotype Imputation and Quality Control

Estimation of Relatedness

Principal components analysis

Rare SNPs

F-statistics

Runs of homozygosity

Linkage disequilibrium

Comparison of power assuming uncorrelated phenotypes

Comparison of power assuming correlated phenotypes

Statistical Software and Web-Resources

Results

Relatedness

Principal components analysis

Rare SNPs

F-Statistics

Runs of Homozygosity

Linkage Disequilibrium

Comparison of power assuming uncorrelated phenotypes

Comparison of power assuming correlated phenotypes

Discussion

Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors' contributions

Electronic supplementary material

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Genomic Data

Contact us