Identification of population substructure among Jews using STR markers and dependence on reference populations included

Background Detecting population substructure is a critical issue for association studies of health behaviors and other traits. Whether inherent in the population or an artifact of marker choice, determining aspects of a population's genetic history as potential sources of substructure can aid in design of future genetic studies. Jewish populations, among which association studies are often conducted, have a known history of migrations. As a necessary step in understanding population structure to conduct valid association studies of health behaviors among Israeli Jews, we investigated genetic signatures of this history and quantified substructure to facilitate future investigations of these phenotypes in this population. Results Using 32 autosomal STR markers and the program STRUCTURE, we differentiated between Ashkenazi (AJ, N = 135) and non-Ashkenazi (NAJ, N = 226) Jewish populations in the form of Northern and Southern geographic genetic components (AJ north 73%, south 23%, NAJ north 33%, south 60%). The ability to detect substructure within these closely related populations using a small STR panel was contingent on including additional samples representing major continental populations in the analyses. Conclusions Although clustering programs such as STRUCTURE are designed to assign proportions of ancestry to individuals without reference population information, when Jewish samples were analyzed in the absence of proxy parental populations, substructure within Jews was not detected. Generally, for samples with a given grandparental country of birth, STRUCTURE assignment values to Northern, Southern, African and Asian clusters agreed with mitochondrial DNA and Y-chromosomal data from previous studies as well as historical records of migration and intermarriage.


Background
The genetics of Jewish populations, particularly that of Ashkenazi Jews, has been studied extensively to answer questions of human evolutionary, historical, and medical significance [1][2][3][4][5][6][7][8][9][10][11]. Human evolutionary or anthropological studies have typically focused on mitochondrial DNA (mtDNA) or Y-chromosomal data, because the absence of recombination in these regions of the genome allows researchers to infer past human behaviors and evolutionary events such as migrations, founder events, population bottlenecks or expansions, relative male and female con-tributions to an admixed population, marriage practices, and mode of transmission of languages [12][13][14][15]. However, medical research necessitates the use of autosomal data. The depth of data collection and the necessary characterization of subpopulations to control for population stratification during case-control association studies provide a unique resource to augment mtDNA and Y-chromosomal studies and to facilitate the investigation of selection events. For population groups in which group identification is based on cultural practices rather than geographic origin (such as religion for the Jews or Spanish language for Hispanics), the hazard in neglecting such structure may be particularly great in medical genetics studies [16][17][18][19]. Y-chromosomal and mtDNA studies of Jewish populations and their local host populations have, at times, provided conflicting results, but can be summarized as supporting the following: 1. Almost all Jewish populations are derived from Middle Eastern ancestral populations [3,8,11,[20][21][22][23]; 2. Bottleneck events have had an effect on the gene pools of Jewish populations [2,[4][5][6]21]; 3. Local female contribution was significant in the establishment of Yemenite, Ethiopian, and Indian Jewish populations [6]; 4. Local male contribution has been less significant for the establishment of most Jewish populations [23], but may have contributed more to Ashkenazi than to non-Ashkenazi populations [3,8,20,22].
Several large-scale studies using autosomal markers demonstrated substructure among European populations, specifically non-Jewish Northern European, non-Jewish Southern European, and Ashkenazi Jews [24][25][26]. Additionally, based on haplotype analysis, recent mtDNA surveys of Ashkenazi and non-Ashkenazi Jewish populations and non-Jewish host populations demonstrated substructure among Jewish populations [6,27]. Although Jewish populations other than Yemenite, Ethiopian, and Indian have not been entirely endogamous, local admixture from host populations, the amount of which varies among populations, has generally occurred at low levels. These historical events may contribute to population structure and stratification that should be taken into consideration in the analysis of data from association studies.
Using thousands of SNPs and principal components analysis (PCA), Seldin et al [25], Price et al [24], and Tian et al [26] found "Northern" and "Southern" components in non-Jewish European populations, which followed a gradient from Northwest Europe to Southeast Europe or North to South, depending on the SNPs used. However, they also reported that both Ashkenazi and Sephardic Jewish samples showed, on average, more than 85% ancestry from the "Southern" component, regardless of grandparental country of birth. They concluded that this reflects a Middle Eastern origin of both Southeast Europeans and Ashkenazi Jews, which both admixed subsequently to varying extents with populations already occupying Europe. A recent study analyzing a large set of autosomal SNPs [10] using PCA demonstrated that not only is it possible to cluster Ashkenazi Jews separately from non-Jewish Europeans but also that the number of Ashkenazi Jewish grandparents determined where a sample fell on the PCA plot relative to non-Jewish Europeans. Recently, using a large number of STRs and several clustering methods, Kopelman et al [28] showed that four Jewish populations (Tunisian, Moroccan, Turkish, and Ashkenazi) clustered together and intermediate to other European and Middle Eastern populations. In all cases, the authors attributed these clustering patterns to the partial and shared Middle Eastern ancestry of Jews.
Middle Eastern ancestry may be a common factor among Jewish populations; however, the majority of Jewish populations have been located outside of the Middle East for up to 2000 years. As is the case with other highly mobile human populations there has been historically documented gene flow between Jewish populations and local host populations. In addition, because these are populations defined, in part, by religion, gene flow into Jewish populations is a product of conversion as well as marriage. Thus, there should be genetic admixture in Jewish subpopulations that reflects, in part, their migratory histories and may contribute to current genetic differences among Jewish populations. It is known that detecting and quantifying recent admixture is dependent on the time since divergence of the putative parental populations as well as the number and information content of markers. Because clustering algorithms are also dependent on the relative differences between populations, the context of a sample in a given analysis (i.e., the extent of its difference from samples of other populations included in the analysis) can affect clustering patterns. This aspect of the process of population substructure detection may be overlooked in case control association studies and may affect results if not taken into consideration. Based on this, we hypothesized that the presence or absence of putative parental populations in a STRUCTURE analysis would affect the ability to detect substructure in Jewish populations and differences between Jewish populations.
To address this question thoroughly prior to conducting association studies of health behaviors among Israeli Jews, we examined population structure in Jewish populations of European, African, Middle Eastern, Central and South Asian origin. We genotyped 526 subjects, recruited in Israel, with 32 genome-wide unlinked microsatellite markers (STRs). To identify potential population structure in the Jewish population being studied, we also genotyped 254 individuals from self-identified Chinese, Thai, Ethiopian Jewish, African American, and European American samples using the same markers. The Jewish populations sampled here are not comprised of various percentages of discreet ancestral populations. Our premise is that Jewish populations originated in the Middle East but, subsequent to and in the course of long-range migrations, accumulated input from local host populations, each with its own migratory history. We include in our analysis genotypic data from present-day populations whose ancestry serves as a proxy for those populations that might have contributed once Jewish populations migrated out of the Middle East. Our results are of interest both to infer unknown and correlate with known aspects of Jewish history and for their theoretical implications for detecting substructure in seemingly homogenous populations. They are also of important applied interest for studying health-related phenotypes in our sample of Israeli Jews. To our knowledge, this is the first study to incorporate proxy parental groups into analysis of structure of a Jewish sample, as well as the first to investigate variation among and ancestry of world-wide Jewish populations with autosomal markers.
Each of the Israeli subjects provided self-reported country of birth, country of birth of parents and grandparents, world region of family origin (not necessarily the same as country of birth of grandparents), whether they considered themselves to be Ashkenazi (as defined by respondents), Sephardic (similarly self-defined), mixed, other or none, and whether they, their parents, and grandparents had been born Jewish (also self-defined). A common practice in the medical and non-medical literatures is to subsume Jews of Spanish, Balkan, Middle Eastern, African, and Asian descent under the term "Sephardic", but since this term implies Spanish origin, it is imprecise and unclear. Further, due to continuous changes in the acceptability and applicability of the term, "Sephardic" among Israelis [29,30], medical and genetic studies involving Israeli participants increasingly refer to subjects as either "Ashkenazi" (AJ) or "non-Ashkenazi" (NAJ) [31][32][33][34]. Below, we also follow that nomenclature. This expands on work we first presented in 2008 [35].

Population Differentiation: group affiliation
When there is no detectable substructure in a sample, after using the program STRUCTURE 2.2 [36,37] each individual will have nearly equal assignment values to each assumed population, giving the appearance that each individual is entirely and nearly equally admixed; when this pattern is observed, the result is not meaningful in terms of actual detection of structure [38]. When the mixed Jewish sample was analyzed alone, using STRUCTURE, the assignment values for K = 2 through K = 4 yielded no detectible substructure (Fig. 1a). When EA, AA and Asian samples were added to the analysis (with the effect of establishing parental populations for clustering), AJ was assigned to Southern 0.23%, Northern 0.73%, Asian 0.02%, and African 0.01%, NAJ was assigned to Southern 0.60%, Northern 0.33%, Asian 0.03%, and African 0.03%, and ANAJ was assigned to Southern 0.34%, Northern 0.62%, Asian 0.02%, and African 0.02% (Fig. 1b). In this case, the best K for the data based on the StructureSum algorithm [39] was 4. Two-sided two sample t-tests via Monte-Carlo permutation with 10,000 repetitions showed significant differences between AJ and NAJ for individual Northern and Southern assignment values (p < 2.2e-16 in both cases) but not for Asian (p = 0.6387) or African (p = 0.1182) assignment values (table  1).

Population Differentiation: grandparental country of birth
Within the Jewish-Israeli sample, for sets of individuals reporting four grandparents from the same country of birth, we averaged percent ancestry from Northern, Southern, Asian, and African components to evaluate possible geographic influences on ancestry components. We found that, in many cases, evidence of admixture with host populations based on our autosomal data confirmed the results of previous mtDNA or Y-chromosome studies [6,20,22,23,40] (Fig. 2 and table 2). However, the variance for Northern and Southern components within each country of birth was high with the exception of individuals with all four grandparents from Germany or all four grandparents from the Ukraine.

Hardy Weinberg Equilibrium (HWE)
No population showed significant deviation from HWE expectations over all loci. Following application of a Bonferroni correction to correct for multiple testing (requiring a p value of 0.05/32 = 0.00156 for significance), no pvalues for individual loci are significant.

Marker Information Content
The tetranucleotide and dinucleotide markers had similar average non-Ashkenazi/Ashkenazi delta values (0.133 and 0.130, respectively). Overall, the delta values for this marker panel (0.131) would not indicate a robust ability to differentiate between these two populations but the results of this study (considering the consistency of the observed ancestry coefficients with known geography and previous studies) show their utility for this purpose nonetheless (table 3).

Discussion
Using 32 autosomal STR markers and the program STRUCTURE, we differentiated between Ashkenazi and non-Ashkenazi Jewish populations in the form of Northern and Southern genetic components. We also demonstrated the utility of including reference populations when attempting to detect population substructure within closely related populations. Notably, we revealed substructure among Jews using a small STR panel, but only when additional samples with ancestry from African, Asian, and European continental populations were included in analyses. The recent study by Kopelman et al [28] used genotypic data from considerably more STRs than our study; however, we found that inclusion of additional populations and high information content per marker apparently compensated, in part, for the relatively low number of markers we used. We also suggest that the clustering patterns in their study may have been somewhat altered if they had not, in effect, assumed that Jewish populations were a product of Middle Eastern and European ancestry, only. Our results indicate that only with the inclusion of world-wide samples is it possible to infer proportions of world-wide ancestry in a highly migratory sample such as Jews with grandparents born on all continents.
Xu and Jin [41] demonstrated both European and Asian contributions to the Uyghur population of Western China when STRUCTURE and PCA were used to analyze European, East Asian and Uyghur sample data. They noted, however, that when world-wide HGDP-CEPH samples including Uyghur were analyzed in other studies, there appeared to be three parental populations for Uyghur: European, East Asian, and Central Asian [42,43]. Consistent with our results, this is an example of the value of including additional reference samples or parental populations in detecting subtle substructure and admixture in the populations to which they contributed.
Our finding that there is little Sub-Saharan African admixture in North African Jewish populations (average percent African ancestry component based on STRUC-TURE results for samples with four grandparents from a given country: Libya (0.07), Morocco (0.03), Tunisia (0.09), Egypt (0.03)) are consistent with findings from the Behar mtDNA study, which detected low rates of Sub-Saharan African, and no North African maternal contribution to Moroccan, Tunisian, and Libyan Jewish populations. Our findings of significant Sub-Saharan African ancestry in Ethiopian Jews (0.55) in contrast to Yemenite Jews (0.03) were also consistent with mtDNA [26] and Ychromosomal [23] studies. Our analyses based on autosomal data cannot rule out local North African contribution to the North African Jewish populations studied here. However, given the finding that non-Jewish North African populations have approximately 25% [44] Sub-Saharan African mtDNA contribution, we would have expected a significant Sub-Saharan component in the North African Jewish populations that we included. That we did not find such a component may reflect relative reproductive isolation among the North African Jewish communities and their host populations.
Our results on population substructure reflect the influence of numerous factors, including the recent founding of Ashkenazi vs. non-Ashkenazi Jewry, gene flow between these groups and between Jewish and non-Jewish populations, a highly complex migration history, and the characteristics and limitations of the marker set   [41] as well as the small set of markers.
Although within each group there is a high degree of variability among individual assignment values, geographic patterns are seen in the average North/South percent assignment values between groups as defined by AJ or NAJ, grandparental world region of birth, or grandpar-ental country of birth. For AJ and NAJ these differences were found to be statistically significant (table 1) (significance was not tested for differences between regions or grandparental country of birth because sample sizes varied greatly). Thus, even based on data from a small marker set, AJ are not a homogeneous population. For non-Ashkenazi Jews, the small measured Sub-Saharan and small inferred Northern African contribution in all Jewish communities of African origin other than Ethiopian may be due to a greater degree of endogamy within those communities.
After demonstrating the feasibility of distinguishing Ashkenazi Jews from non-Jewish Europeans using autosomal SNPs, Need et al [10] analyzed these samples in conjunction with a number of Middle-Eastern populations and concluded, in contrast to Behar et al [6], that the differentiation of Ashkenazi Jews from non-Jewish Europeans was due to their Middle-Eastern ancestry rather than a bottleneck event because the Ashkenazi Jewish sample had high heterozygosity. Although we    [45][46][47][48]. Jews settled in India as early as the 7 th century CE, possibly from Iran or Yemen and incorporated local residents as well as slaves into the population [48].

Theoretical implications
The large Jewish-Israeli sample in this study was collected as part of a greater study on health-related phenotypes; it contained 526 individuals with grandparents from a broad geographic range: Northern and Southern Europe, Russia, North Africa, Ethiopia, the Middle East, Central Asia, and India. While it is not surprising that this sample (based on historical accounts, mtDNA and Y-chromosome studies, and geographic range) has components of all three major continental populations that vary based on recent ancestral origin, these differences in ancestry were not detected without the addition of putative parental population or reference samples in STRUCTURE analysis. The significance of the resultant ancestry components could not have been evaluated in the absence of self-reported information on family history and identification. Such detail is not always available for population samples; however it proved to be highly valuable in this    case. Others have shown that the ability to detect population substructure is dependant, in part, on sample size or the inclusion of reference populations [41,49,50]. For a sample in which populations have diverged recently or have low levels of genetic differentiation (such as Ashkenazi and non-Ashkenazi Jews), the ability to detect substructure increases with the amount of data available, with the total data being a result of information derived both from the number and informativeness of samples as well as the number and informativeness of markers [49,51]. This issue has great practical relevance for the substructure testing phase of association mapping studies in which cases and controls are from the same self-identified population group, particularly when increasing the number of AIMs is not an immediate option. Numerous other genetic studies have shown that Jewish populations, while sharing ancient Middle-Eastern ancestry, have practiced exogamy or incorporated members of local populations to some extent [3-6, 8, 11, 20, 22, 23]. It is this gene flow from host populations combined with genetic drift and possible local selection pressures, that have led to detectable substructure among Jewish populations, perhaps more so than would be expected based solely on time since population divergence.

Historical Implications
In contrast to Seldin et al [25], we showed varying Northern and Southern components among Ashkenazi Jewish populations. Non-Jewish population samples in the Seldin et al study were European or European American in origin, while our study included African-American and Asian samples as well as European Americans. In addition, the Jewish sample in Seldin et al included only three Sephardic (based on their nomenclature) Jews, too few to provide reliable information about this population. The Kopelman et al study [28], in addition to European Jews and non-Jews, included two North African Jewish populations as well as Middle Eastern non-Jewish populations. Our study included a large number of non-Ashkenazi Jews including those from North Africa, the Near East, Ethiopia, and Central and South Asia. We believe that our study demonstrates the effect of relative population genetic differences and total information from markers and individuals on clustering patterns. As we have described previously, our STR panel was chosen specifically for its high information content [50]. Although the major consideration in marker selection for this panel was the ability to differentiate between major American populations, we previously demonstrated that the same panel was, somewhat unexpectedly, also very useful at distinguishing different Asian populations [52].
The migratory history and origins of Ashkenazi Jews are less clear than those of non-Ashkenazi Jews. During the early Middle Ages in Europe, Jews lived in close proximity with their non-Jewish neighbors in small villages with constant interaction. Intermarriage, although periodically outlawed by host-country governments, occurred with some regularity [53]. In addition, Jewish Europe was never solely inhabited by Ashkenazi Jews. Some Jews expelled from Spain during the inquisition settled in part of the Ottoman Empire, which includes the Balkans (present-day Turkey, Bulgaria, Greece, Bosnia, and Serbia), while others went to Italy, Holland, and France [54]. In fact, all subjects in this study who identified their grandparents as having come from Balkan countries also identified themselves as non-Ashkenazi and those with two grandparents from Balkan countries identified the parent on that side as non-Ashkenazi.
We believe that the apparent Southern genetic component of those of European descent (Jewish or not), as well as that seen in Jews, is actually originally Middle Eastern in origin. This is consistent to various degrees with previous results from y-chromosomal [3,8], mtDNA [3,5,21] and autosomal evidence [25,26] as well as historical evidence. The large Northern component in all Ashkenazi populations included in this study indicates significant local contribution to these populations, which either occurred early in their histories (German and Ukraine) or in small increments over time (other Ashkenazi populations as evidenced by high variance of Northern/Southern components). Among non-Ashkenazi Jewish populations sampled, although all have exhibited porous membership, Ethiopian and Indian Jewish communities have particularly significant local contributions to their gene pools.
The high variance for percent Southern and Northern components for group affiliation, region of birth, and grandparental country of birth, indicate that admixture and migratory events are recent [41]. This reflects both the complex migration histories of Jewish populations and the limitations of the marker set used here, including the possibility of homoplasious alleles interfering with accurate ancestral population assignment. Despite small sample sizes, the variance was small for individuals with four German (N = 6) (Southern σ 2 = 0.005, Northern σ 2 = 0.005) or four Ukrainian (N = 6) (Southern σ 2 = 0.003, Northern σ 2 = 0.005) grandparents, which may indicate that admixture events in these populations are older than those of other Jewish populations in our study.

Conclusions
Our results reinforce conclusions of previous characterizations of Jewish samples based on uniparentally-inherited segments of the genome. Jewish populations are not necessarily genetically homogeneous, either as a whole, within the Ashkenazi or non-Ashkenazi affiliations, or within a continent. Geographic gradients of genetic het-erogeneity such as that observed here within what is seemingly one population have been shown empirically to confound association studies [26], but in the absence of a very large AIM panel, are correctable when information such as grandparental country or region of birth is used to create subsets of matched cases and controls [55,56].
Although clustering programs such as STRUCTURE are designed to assign proportions of ancestry to individuals without the necessity of including parental population information, when our mixed Jewish sample was analyzed without the EA, AA, Thai, and Chinese samples, the substructure within Jews was not apparent. While it is true that Jewish samples would be shown to contain substructure if analyzed with thousands of SNP markers or hundreds of STR markers, it is unlikely that subtleties contributed by Asian and African admixture would be detected without inclusion of world-wide reference samples. For example, Kopelman et al [28] used data from 678 STRs for four Jewish populations (Moroccan, Tunisian, Turkish, and Ashkenazi) combined with that of Middle Eastern and European populations and it was found that the Jewish populations had ancestry to varying degrees from both European and Middle Eastern populations. They do not find information regarding Asian or Sub-Saharan African admixture because that is not possible without the inclusion of samples from those regions. When they used STRUCTURE to analyze their Jewish samples, alone, the best fit for the data was two parental populations.
We demonstrated empirically, the effect of reference population inclusion on the ability to cluster individuals in an admixed population. Studies commonly control for population stratification by genotyping subjects, only, with a panel of non-coding markers. However, when cases and controls have been matched (non-genetically) for ancestry and no other populations that could potentially contribute to admixture are included in the analysis, any existing substructure is unlikely to be detected. We suggest that samples of reference or proxy parental populations be included in the substructure testing phase of case control association studies when the participants are sampled from potentially admixed populations such as populations residing in or originating from major human migratory pathways, urban populations, or American populations.
The total number of markers used in this study is quite small in comparison to many other available studies, but due to higher mutation rates and number of alleles per locus STRs provide much more information, on average, than SNPs for population assignment and population stratification [19]. The high variation and high mutation rates for STRs may backfire, however, when attempting to distinguish between populations that have diverged long ago, as homoplasic alleles can accumulate under those circumstances. We found previously [52] that the tetra-nucleotide CODIS loci were not useful in distinguishing between AA and EA populations while they were highly informative when distinguishing among more recentlydiverged Asian minority populations [57]. This marker set may be more useful for detecting recent admixture or founding events, such as those which formed the Jewish populations in question, here. These sub-samples do not add up to the total mixed Jewish sample due to missing self-report parental or grandparental information. The Asian populations in this study were collected as part of an ongoing gene mapping study. Samples of individuals self-identified as being of Thai and Chinese ancestry were obtained from a blood drive in Bangkok, Thailand. The Thai and Chinese samples used in this study were selected to include only subjects for whom all four grandparents were reported to have the same self-identified ethnicity as the subject. For analyses the Thai and Chinese samples were combined into one sample labeled Asian. The dataset also included samples of unrelated AAs and EAs a subset of a sample described elsewhere and for which population group self-identifications were previously confirmed via Bayesian marker clustering [50]. Note that the EA sample was not screened to exclude AJ or NAJ subjects so it is likely to contain small numbers of them. The EJ sample was obtained from the National Laboratory for the Genetics of Israeli Populations, Sackler Faculty of Medicine, Tel Aviv University, Israel. This work was approved by the Yale University School of Medicine Human Investigation Committee HIC#12183, New York State Psychiatric Institute Institutional Review Board protocol#4753, Israel Board of the Ministry of Health Helsinki Committee for Genetic Trials #920050036, and the Department of Veterans Affairs Subcommittee on Human Studies #0008. All subjects provided informed consent as approved by the appropriate institutional review boards.

Markers and Genotyping
All samples were genotyped for thirty-two unlinked autosomal STR markers (with the exception of EJ, for which data are missing for D8S272 . The amelogenin locus, included in the AmpF/STR Identifiler PCR Amplification kit for sex identification, was not included in any analyses. All STR markers were analyzed on an ABI PRISM 3100 semiautomated capillary fluorescence sequencer. Data were scored using Genemapper (ABI). We previously used this marker panel (with the addition of D1S196, D2S319, D7S657, D12S352, D14S68, which were not used here either because they were replaced with D7S2469 and D1S2628 or because of a large number of failed genotypes) to determine and statistically correct for ancestry in case-control studies and genome-wide linkage studies [58][59][60][61] and in population genetics studies [52].

Statistical Analyses Population Differentiation
The program STRUCTURE 2.2 [36,37] uses Bayesian clustering of multilocus genotypes to assign individuals to populations, estimate admixture proportions for individuals, and infer the number of parental populations (K) for a sample. Because variance of STRUCTURE results increases with small sample sizes each run was repeated 5 times with all 32 STR markers in the panel. For analyses that included AA, EA, Thai, Chinese, EJ and mixed Jewish populations, the parameters used were K = 2 through K = 9 and 50,000 burn-in and 50,000 Markov chain Monte Carlo (MCMC) iterations. For analyses that included only the EA and mixed Jewish samples in the absence of all other samples, the parameters used were K = 2 through K = 4 and 50,000 burn-in and 50,000 Markov chain Monte Carlo (MCMC) iterations. The self-reported population of origin was not used as additional data by STRUCTURE and the presence of admixture was assumed.
The authors of STRUCTURE recommend using the maximal value for lnP(D) to determine the best value of K for the data. However, it has been observed that lnP(D) will plateau while continuing to increase slightly as assumed K increases past the correct K. Therefore, identifying the K for which lnP(D) is greatest may not be sufficient to identify the correct (underlying) K. We employed StructureSum, an R script that uses the output from STRUCTURE to identify the K for which lnP(D) is maximized while both |lnP(D) K+1 -(lnP(D) K -lnP(D) K-1 )| and variance of lnP(D)are minimized. This identifies the highest value of K, prior to the plateau of lnP(D) [38].
STRUCTURE runs were unsupervised, using the admixture model and correlated allele frequencies. Structure randomly assigns clusters in each run such that the correspondence between runs is non-obvious. CLUMPP software [62] takes the multiple results files and determines which clusters from different runs correspond, then averages the assignment values between runs for each individual. To account for cluster label switching between runs, we used the fullsearch option and nonweighted alignment procedure in CLUMPP version 1.1.1 to identify corresponding clusters between runs for a set of five runs with a given K and to produce average membership coefficients for each individual for each cluster. These average assignment values were used with the program, DISTRUCT [63], to produce graphs of STRUC-TURE output.
For Jewish-Israelis with four grandparents from the same country, for K = 4, individual assignment values produced by CLUMPP were averaged to arrive at values for Northern, Southern, Asian, and African ancestral components for the Jewish population of that geographic location. This was also done for AJ, NAJ, and ANAJ.
A two Sample t-test via monte-carlo permutation was used to test for significance between the individual AJ and NAJ assignment values with 10,000 samples simulated when H 0 (no significant difference between average assignment values for two populations) is true. The t-test value of the observed data was then compared to that of the simulated data to obtain a p-value for mean differences in assignment values between AJ and NAJ.

Hardy Weinberg Equilibrium (HWE)
Tests for deviation from Hardy-Weinberg equilibrium expectations were conducted using GENEPOP 4.0 [64] globally for all loci using sub-option 5, the exact test for HWE in which H 1 = heterozygote excess based on a Markov chain method. The parameters used were 5000 dememorizations, 1000 batches, and 5000 iterations per batch. The parameter values were increased from defaults until the observed standard error for p-values was less than 0.01. For the mixed Jewish sample this was performed for the sample as a whole, as well as for the AJ, NAJ, and ANAJ subsets. We used an exact test for multiallelic markers because Chi-squared tests are inappropriate for such analyses [65].

Marker Information Content
Markers were evaluated for delta (δ) [66], a measure of marker information content, reflecting the ability of a marker to differentiate statistically between populations.
We have confirmed that this is a relevant measure for the markers we employed herein [50]. To arrive at δ, the absolute values of allelewise frequency differences between two populations are added and this sum is divided in half, i.e., where and are the allele frequencies for the i th allele in populations A and B. The more effective the marker is at differentiating between populations, the higher the value for δ [50].
Authors' contributions JBL designed the study, carried out statistical analyses and drafted the manuscript. DH participated in project coordination, sample collection, and the writing of the manuscript. AS and AM participated in sample collection in Thailand. HRK carried out sample collection in the United States. RTM participated in project coordination and sample collection. EA and BS participated in project coordination and sample collection in Israel. JG participated in study design and supervision, project coordination, sample collection, and the writing of the manuscript. All authors read and approved the final manuscript.