Developing a set of ancestry-sensitive DNA markers reflecting continental origins of humans
© Kersbergen et al. 2009
Received: 13 March 2009
Accepted: 27 October 2009
Published: 27 October 2009
Skip to main content
© Kersbergen et al. 2009
Received: 13 March 2009
Accepted: 27 October 2009
Published: 27 October 2009
The identification and use of Ancestry-Sensitive Markers (ASMs), i.e. genetic polymorphisms facilitating the genetic reconstruction of geographical origins of individuals, is far from straightforward.
Here we describe the ascertainment and application of five different sets of 47 single nucleotide polymorphisms (SNPs) allowing the inference of major human groups of different continental origin. For this, we first used 74 cell lines, representing human males from six different geographical areas and screened them with the Affymetrix Mapping 10K assay. In addition to using summary statistics estimating the genetic diversity among multiple groups of individuals defined by geography or language, we also used the program STRUCTURE to detect genetically distinct subgroups. Subsequently, we used a pairwise FST ranking procedure among all pairs of genetic subgroups in order to identify a single best performing set of ASMs. Our initial results were independently confirmed by genotyping this set of ASMs in 22 individuals from Somalia, Afghanistan and Sudan and in 919 samples from the CEPH Human Genome Diversity Panel (HGDP-CEPH)
By means of our pairwise population FST ranking approach we identified a set of 47 SNPs that could serve as a panel of ASMs at a continental level.
Forensic DNA profiling of biological crime scene samples of human origin is typically performed for identification of individuals, by matching the DNA profiles obtained to victims, suspects, or occasionally, missing persons. When a multilocus DNA profile from a crime scene sample fully matches the DNA profile of, say, a suspect, forensic laboratories often report a random match probability, commonly referred to as "match probability". The match probability reflects the probability of sampling a random individual with an identical multilocus genotype as found in the crime-scene stain (and matching the suspect) and is computed for different populations using population specific allele frequencies for which a priori assumptions about the genetic ancestry of the crime scene sample "donor" have to be made [1, 2]. Inferring the geographic origin of the (unknown) donor can give an extra dimension to criminal investigation, e.g. narrowing down or diverting the DNA dragnet for the police [1–3]. However, ascertainment of the genetic ancestry of sample donors can only be done reliably if two basic requirements are met: (i) the availability of a suitable set of ancestry-sensitive markers (ASMs), and (ii) a reference database with global frequencies of such ASMs. These two requirements and the potential importance of the ability to be able to identify the geographical origin of unknown suspects of crimes was recognized by Dutch politicians, and in 2003 the Dutch parliament passed a law providing the legal frame work that could not only facilitate research on ASMs but also provided the legal possibility to actually apply them in forensic case work .
Loci exhibiting significant allele frequency differences among groups of individuals stratified on the basis of geography are often called ancestry-informative markers (AIMs). However, as indicated above, we prefer to use the term ancestry-sensitive markers (ASMs) because in our opinion ancestry sensitivity better reflects the uncertainties related to such marker whereas ancestry informativity implies that they clearly reveal ancestry. ASMs are already utilised for genetic association studies, drug response testing, admixture mapping and reconstructing evolutionary histories [5–9]. In a strict sense, the first ASMs to be developed and used for forensic applications were mitochondrial DNA (mtDNA) and Y-chromosomal polymorphisms [10–12]. An obvious disadvantage of the use of mtDNA and Y-chromosomal polymorphisms is the fact that they only provide information concerning the strict maternal or paternal ancestry. Therefore, they can only detect genetic admixture, when contrasting information of both marker types is obtained, but such admixture cannot be quantified based on such markers. By means of mtDNA and Y-chromosomal polymorphisms it is also extremely difficult to differentiate recent from ancient admixture episodes in the history of an individual. Especially in a forensic application, where ancient admixture events are rarely relevant but where recent admixture events can leave misleading mtDNA and Y-chromosomal signatures, this is a clear disadvantage. In order to obtain a more accurate inference of an individual's genetic ancestry including the identification of possible recent genetic admixture, autosomal DNA markers are more suitable.
Different sets of autosomal ASMs with at least a continental resolution have already been identified and published [5, 10, 13–15], but the available marker panels are still far from perfect, mainly due to different ascertainment strategies and reference datasets. Different approaches have been suggested to ascertain ASMs, including (i) BLAST searches comparing sequence variation among individuals from different populations , (ii) selecting SNPs in databases by focussing on extreme allele frequency differences values between populations , and (iii) exploring existing SNP panels with tens to hundreds of genetic markers via commercially available SNP-genotyping arrays [3, 17–19]. Many of these previous studies first used non-genetic criteria such as geography or language in order to a priori stratify the study populations. Especially in the case of geographically distinct populations that could possibly share a similar genetic make-up, this is not always the most optimal strategy.
Here, we apply STRUCTURE, a software package that is widely used for detecting subgroups of genetically similar individuals among large populations [16–21], to identify suitable sets of ASMs, using the single nucleotide polymorphism (SNP) genotyping results obtained by means of the Affymetrix GeneChip® Human Mapping 10K Array Xba131 (Mapping 10K array) [22–24] in the Y Chromosome Consortium (YCC) cell line panel . With this panel we compared five different sets of ASMs that were identified on the basis of two statistical estimates summarizing overall genetic diversity among geographically predefined populations, FST [, see also the methods section] and the Informativeness of ancestry Index (I n) , and two ascertainment procedures based on pairwise FST comparisons. This resulted in the selection of a set of SNPs that could be used as forensically relevant ASMs. Subsequently, these ASMs were further tested in two independent sets of samples: (i) individuals from Somalia, Afghanistan, and Sudan, and (ii) the CEPH Human Genome Diversity Panel (HGDP-CEPH) [27, 28].
Further analyses of the STRUCTURE results of the last set (4gen pairwise FST SNP-set) via the coda package in the program R revealed full convergence of the Markov Chain of Monte Carlo (MCMC) in the STRUCTURE analyses in the Asian and Eurasian cluster, but not in the African and Native American clusters, suggesting an even more subtle subgrouping within these two groups (results not shown). A clusteredness analyses showed that the best fit for the total YCC dataset of 8,474 SNPs and of 47 ASMs (4gen pairwise FST) was four major clusters (See additional file 2: rs-numbers of ascertained ASMs and additional information). As a final test, we analysed the 4gen pairwise FST derived set of ASMs by means of Haploview in order to exclude strong linkage disequilibrium (LD) among the 47 SNPs. No significant LD could be detected (results not shown).
The Mapping 10K array applied to the YCC samples proved to be a valuable tool for the identification of a set of ASMs with continental resolution. We found that STRUCTURE is a very useful program at various stages of the identification of ASMs. We demonstrated that the geographical distribution of samples is not necessarily the most optimal a priori stratification criterion. Barbujani and Belle  emphasize that if different sets of genetic data are analysed or the same dataset analysed with different methods, the same clusters should be found if the differences between populations and continents are large enough and consistent across loci. In our study, both different sample sets as well as different ascertainment methods were used and we consistently found four genetically distinct clusters of individuals corresponding to their different geographical origins being the best "fit" (Sub-Saharan Africans, Eurasians, East Asians (including Oceanians), and Native Americans). We also learned that ASM ascertainment by means of a single FST (classical) estimate across all population samples or groups is not the most optimal strategy. Classical FST values indicate the distinctive power of a SNP, however cannot identify directly for which population(s) this particular SNP could serve as a suitable ASM. For datasets with markedly different populations, classical FST will most likely render SNPs that are specific for only the most distinct population(s), as we showed here for differences between Africans and non-Africans (additional file 1: FST analyses of the five different sets of ASMs among the 74 YCC samples). Although STRUCTURE revealed that ASMs identified using I n (especially 4gen I n) resulted in a set of ASMs with a good clustering of the individuals in the YCC-panel into four major genetic groups, retrospective FST analyses revealed that the I n based sets of 47 ASMs predominantly identified markers that distinguish African from non-African individuals. Only the use of 4gen pairwise FST values provided ASMs with an optimal continental resolution of the YCC samples, and evenly distributed pairwise FST values among all population pairs when compared with the other ascertainment criteria.
According to Gao and Starmer  ideal loci for distinguishing between populations, are those that have a fixed allele in one population and are absent in all other populations. In our method of screening (Mapping 10K) such SNPs could not be found due to the SNP selection procedures by Affymetrix, although other studies have been able to locate such SNPs using different screening methods. One example is the SNP analyses of skin pigmentation genes (e.g. MATP) by Norton et al. .
The apparent improvement of the genetic clustering of individuals using 47 SNPs (4gen I n and 4gen pairwise FST) instead of 8,474 appears peculiar at first, but can readily be explained by interference of the SNPs with evenly distributed frequencies among populations. As such, the complete set of 8,474 SNPs reflects the most unbiased genetic make-up of all populations sampled, whereas the final set of 47 (or 34) ASMs only reflect the subtle genetic variation among populations maximized by means of extremely selected SNPs. STRUCTURE results (K = 4) were not altered after the addition of individuals originating from countries not present in the YCC panel (such as Somalia, Sudan, and Afghanistan). This can be interpreted as a first indication that the ascertained SNPs may indeed be reliable ASMs. Further analyses on a second much larger number of independent samples, the HGDP-CEPH (H919), proved independently the value of the selected ASMs for identifying the geographic origin of a DNA sample.
Our set of 47 ASMs identified using pairwise FST was also able to detect mixed genetic ancestries. The individuals sampled in Algeria (Mozabite), appear to have a mixed ancestry between African and Eurasian descendants. The individuals from Hazara and Uygur exhibit both Eurasian and Asian genetic ancestries. This was also concluded by Rosenberg et al.[17, 32] on the basis of a screening of samples from the HGDP-CEPH by means of 783 microsatellites and 210 insertion/deletion markers, as well as by Jakobssen et al. and Li et al. who analysed partially the same samples by means of > 500,000 and > 650,000 SNPs, respectively. As such, our results with the 47 ASMs following the 4gen pairwise FST SNP ascertainment, are surprisingly similarly to the results based on the 40 most informative STRs identified by Rosenberg et al. . Increasing the number of SNPs considerably does not necessarily improve the genetic clustering of the same HGDP-CEPH samples. For instance, Jakobssen et al  were only able to identify the Native Americans (NAM) as a distinct cluster at STRUCTURE analysis for K = 6, not at lower values of K. An even further increase in the number of SNPs (Li et al. ) resulted in STRUCTURE results at K = 4 that were similar to those obtained by us with 47 ASMs. Only upon increasing the values of K, Li et al.  were able to produce much more refined regional subclusterings. It is in this respect important to note that our final aim is to use a set of ASMs for forensic DNA research purposes. This imposes an important constraint on the amount of DNA available for an ancestry test, since rarely more than a few nanograms of input DNA is available, an amount of DNA that is not sufficient for aforementioned mass SNP genotyping platforms as used by us in this study (but also see e.g. [33, 34]), but would be sufficient for a limited number of multiplex SNP assays enabling the genotyping of 40-50 SNPs. Reducing the 47 ASM-set ascertained based on pairwise FST to 34 ASMs did not alter the STRUCTURE analyses with K = 4 but did show differences at other values of K.
Since the ASMs ascertained in this study were developed for distinguishing between the four major continental groups, individuals from other populations (e.g. Oceania (n = 2)) that were not (or under) represented in the YCC panel for marker ascertainment, were initially clustered within their most likely source population such as (East-)Asia in the case of Oceania. However, in the independent HGDP samples Oceania could be separated from East Asians at least with 47 ASMs and at K = 6. This illustrates that although the 47 ASMs were not explicitly selected to separate sub-groups within major continental regions, they do prove to be useful for this purpose at least in some cases.
Our pairwise FST method for ascertaining ASMs ensures the absence of bias towards the African population, which usually shows the largest genetic differentiation compared to all non-African groups. This can easily be seen in the many studies analysing the HGDP-CEPH panel. In most analyses involving STRUCTURE or similar analytical tools the first split of the total group of individuals always could be related to a split between African versus non-African samples (e.g. ). At K = 2 our 34 ASMs, but not our set of 47 ASMs, showed a split between African and all non-African samples. A similar less prominent first split between African and non-African samples was also reported in other studies [17, 34–36]. Despite these differences in clustering patterns for lower values of K, most studies report an identical major clustering pattern at K = 4 (Africa, Eurasia, East Asia (including Oceania) and Native America).
The ASMs ascertained in our study did not allow the differentiation among Eurasian individuals, i.e. individuals from Europe, Middle East as well as West and South Asia, which is similar to earlier findings based on a large number of STRs . Finding ASMs with higher levels of specificity among Eurasian populations would improve the analyses considerably but obviously needs a different sampling design. Recently, two other largely overlapping studies were published [37, 38] and demonstrated that the identification of the geographic sub-region of origin of an individual within Europe is possible within certain limits.
Other groups have also identified sets of ASMs that could differentiate the major geographical regions [39, 40]. One example of such a different set of 34 SNPs is identified by Phillips et al. . These SNPs do not overlap with our set of 34 and 47 ASMs, and were specifically sought in close proximity of genes that have been subjected to strong regional positive selection in the recent past . Their set is able to distinguish between African vs. European, Middle Eastern - Central/South Asian vs. Oceanian, vs. East Asian vs. (Native) American groups also using the HGDP samples, similar to our findings, although at K = 4 it was not possible to distinguish Native American individuals from East Asian and Oceania. As such this is yet another indication that many other different sets of ASMs could be identified, depending on the a priori selection criteria.
By means of a pairwise FST ranking approach we identified a set of 47 SNPs that could serve as a panel of ASMs at a continental level. Multiplex genotyping (e.g. by means of SNaPshot® genotyping) of such a restricted number of SNPs is feasible on the basis of only a few nanograms of input DNA. This not only enables genotyping these ASMs in DNA samples from population studies, but also makes it possible to analyse the limited amounts of DNA from forensic crime scene samples. Obviously, using such a set of ASMs and reporting the results in terms of prediction of ancestry in forensic casework is still in its infancy. The use of this set of ASMs (and others) among a much large number of samples from known globally distributed locations in order to verify and refine the possibility to predict the ancestry of an unknown sample is one of our next research priorities, as well as developing a simple statistical framework one can use to report ancestry probabilities.
We used a set of male DNA samples, released by the Y Chromosome Consortium (YCC) . This set consists of 74 globally dispersed males and is subdivided in six geographical regions; (Central) Africa, South Africa, Asia (including Pakistan), Europe, Northern Asia (Russia and Siberia), and Native America. Each region consists of roughly similar numbers of individuals . The samples in the YCC were either donated by volunteers who gave informed consent, donated by other research groups or purchased from the Coriell Institute . For a first verification test 22 unrelated individuals from Somalia (n = 5), Afghanistan (n = 12), and Sudan (n = 5), selected from our immigration paternity casework samples, were genotyped for the set of ASMs finally ascertained. For the use of samples from these latter three populations, permission was granted by the responsible Dutch immigration authorities under the condition that the samples would remain fully anonymous whilst retaining the country of birth of each individual. Subsequently, proof of principle was obtained by testing the CEPH Human Genome Diversity Panel (HGDP-CEPH) . The complete HGDP-CEPH consists of 1,064 individuals from 51 globally distinct populations. The use of HGDP-CEPH samples is fully explained by the initiators of the sample repository . All selected ASMs were also genotyped in the full HGDP-CEPH. Before statistical analyses, all sample duplicates, samples with labelling errors, and all closely related individuals where removed, resulting in a set of 952 HGDP-CEPH individuals (H952) [7, 17, 41, 42]. During the course of our analyses we found out that H952 contained 33 YCC panel samples which were subsequently removed resulting in a final non-overlapping HGDP-CEPH panel of 919 unrelated individuals (H919).
Throughout this manuscript we use the word "ascertainment", instead of other often used words as "identified" or "selected" to indicate the full process of ASM marker discovery. This process, due to the nature of the markers used in this study, did not use objective and strict criteria or thresholds of, e.g. allele frequencies.
All 74 individuals from the YCC panel and the additional 22 individuals from Somalia, Afghanistan, and Sudan were genotyped for the 11,555 SNPs included in the Mapping 10K array according to the Affymetrix protocol for the GeneChip® Mapping Human 10K Array (GeneChip® Mapping Assay Manual). These SNPs have a median physical distance of 105 kb, an average distance 210 kb, and an average heterozygosity of 0.37 in the Affymetrix test panel. After hybridisation, the arrays were washed, stained, scanned and analysed with the software supplied by Affymetrix (Affymetrix Inc., Santa Clara, CA) .
The Mapping 10K arrays produced a dataset of 878,180 SNPs (11,555 per individual), however not every SNP tested resulted in a reliable genotype call. On average, the arrays genotyped 92 percent of the SNPs per individual (default settings in Affymetrix software), which effectively represents approximately 10,635 SNPs. From this dataset we selected only those SNPs for which 90% or more of the 74 YCC individuals had a valid genotype. This resulted in 8,650 SNPs per individual. Subsequently, we excluded all SNPs on the X-chromosome (n = 176), leaving a final dataset of 8,474 SNPs. The 8,474 SNP-set was used for further analyses.
We initially explored the dataset of 8,474 SNPs among the 74 YCC individuals by means of the program STRUCTURE version 2.1 and 2.2 [see reference  and the STRUCTURE manual for detailed information]. STRUCTURE can be used to identify genetic clusters among a set of individuals on the basis of multilocus genotypes. Assuming admixture among populations, correlated allele frequencies, and no prior population information we used STRUCTURE to explore different number of possible clusters (K) from 1 to 8. For each K, a maximum of 20 runs were performed with a burnin length 20.000 iterations followed by 10.000 MCMC iterations We tested for convergence using the coda Package in the program R (available on http://www.r-project.org/). Clusteredness was tested following equation 3 from Rosenberg et al. . STRUCTURE was used (i) for the ascertainment of the most optimal number of distinct clusters, (ii) to subsequently test the performance of different sets of selected ASMs, and (iii) for the proof of principle analyses. The output of STRUCTURE is in the form of a structural map showing the relative estimated proportion of contribution of K ancestral populations (or clusters or subgroups) for each individual. In Figure 1 (showing the result of the best-fit- STRUCTURE analyses of 8,474 SNPs among the 74 YCC individuals) each vertical bar represents a single individual and each colour a distinct ancestral population. Upon running STRUCTURE, the best "fit" (i.e. the most optimal number of K) was four, hence the four different colours in this figure. Individuals are analysed without any prior population information, but are sorted once STRUCTURE is completed by their sampling population. The STRUCTURE output graphs in Figures 1, 2, 3, 4, and 5, were produced using Microsoft Excel and Microsoft PowerPoint based on the raw STRUCTURE output tables.
Our dataset comprised of six geographically defined groups (6geo). Instead of relying on this a priori sample stratification, we performed STRUCTURE analyses to obtain a more unbiased estimate on the number of (genetically) distinct subgroups. Structure analyses on the full dataset (8,474 SNPs) revealed a best fit at four genetically distinct subgroups (4gen). These four genetic subgroups reflect the four different continental origins of all samples.
In order to identify possible ASMs we initially used two estimators: FST (among population F-statistics)  and I n (Informativeness of assignment or ancestry) . Classical FST is a measure of population differentiation based on polymorphic genetic data, such as single nucleotide polymorphisms. It compares the genetic variability between any number of populations. When used among more than two a priori defined populations it provides an overall estimate and does not indicate which specific population or populations is genetically more distinct. The I n statistic computes the potential of assigning an allele to one of the defined (genetically or a priori) clusters considering all the clusters in the study (e.g. six geographical sampling areas or four major continental groups) .
In a combined dataset one or more subgroups or clusters could be more genetically distinct from the others. The I n statistic is proportional to the total amount of differentiation between populations and is expected to be less influenced by more genetically distinct subgroups. However, in classical FST calculations this could result in the ascertainment of ASMs specific for a single most genetically distinct cluster. To avoid this, we used a third strategy in which FST values (pairwise FST) were estimated by only comparing two clusters (genetically defined or a priori) at a time instead of two or more in classical FST analyses. Pairwise FST values were computed for all possible pairwise comparisons among the dataset. High pairwise FST values were sought by comparing the groups for all possible pairwise comparisons among the dataset, guided by the classical FST values. Pairwise FST were estimated following a geographically defined distribution of the dataset (6geo) as well as a genetically distinct distribution (4gen). For the classical FST values the 97.5 percentile upper boundary was calculated and all pairwise FST values higher than or equal to this value were given a value of one. SNPs with a value of one for all pairwise comparisons or certain specific ones (for example Asia versus all other continental groups except for instance Africa) were considered suitable for ascertaining as ASMs.
For the ascertainment of ASMs, all SNPs were simply ranked according to the highest values of FST and I n. Subsequently, the top 50 SNPs with highest classical FST and I n values were selected from the six geographically a priori defined groups (6geo) distribution of our dataset as well as the four genetic posteriori defined clusters (4gen) distribution (the latter identified through STRUCTURE analyses). The pairwise FST ascertainment of ASMs was restricted to the dataset with the genetic distribution (4gen) of individuals, since the geographic distribution (6geo) yielded too few group-sensitive SNPs due to erroneous a priori geographical assignments of samples. Initially, similar amounts of group-specific SNPs (ASMs) were obtained per major continental group. However, some SNPs could not be reproduced via the TaqMan assay and were omitted from the ASM-set, leaving a set of 47 ASMs. As the omission of these SNPs did not alter the STRUCTURE results, no new SNPs were selected as ASM and from the other SNP-sets (following from classical FST and I n computations) the last three SNPs from the top 50 SNPs were removed, leaving the top 47 loci for further analyses. To summarize, by means of 5 different ascertainment procedures we designed 5 different sets of 47 SNPs that could potentially serve as ASMs.
By means of Haploview  we tested the selected SNPs for possible linkage disequilibrium (LD), and no significant LD could be detected.
The selected ASMs were genotyped in the HGDP-CEPH via the TaqMan® Technology. Of each sample three nanograms of DNA were dried in open air in 384-well clear optical reaction plates (Applied Biosystems, Foster City, CA, USA). Two μl of amplification mix was added containing 1 μl Absolute QPCR rox mix (ABgene, Epsom, UK), and either 0.1 μl 20× Custom TaqMan® SNP Genotyping Assays (Applied Biosystems) or 0.05 μl 40× TaqMan® SNP Genotyping Assays (Applied Biosystems). Amplification (15' 95°C, 40 cycles 15" 95°C and 1' 60°C) was performed in a GeneAmp 9700 PCR cycler (Applied Biosystems) followed by a subsequent end-point reading on an Applied Biosystems 7900 HT Fast Real-Time PCR System according to the manual of the manufacturer. Assay numbers or primer and probe sequences in case of assay on design are provided as supplement material (see addition file 3: TaqMan Assay).
We thank Oscar Lao and Titia Sijen for valuable comments on an earlier version of the manuscript. We also wish to acknowledge the constructive remarks made by two anonymous reviewers. Based on their remarks we could substantially improve our manuscript. This work was supported by funds from the Netherlands Forensic Institute to PdK and MK. Further support was obtained by a grant from the Netherlands Genomics Initiative/Netherlands Organisation for Scientific Research (NWO) within the framework of the Forensic Genomics Consortium Netherlands.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.