Reconstructing recent human phylogenies with forensic STR loci: A statistical approach
© Agrawal and Khan. 2005
Received: 29 May 2005
Accepted: 28 September 2005
Published: 28 September 2005
Skip to main content
© Agrawal and Khan. 2005
Received: 29 May 2005
Accepted: 28 September 2005
Published: 28 September 2005
Forensic Short Tandem Repeat (STR) loci are effective for the purpose of individual identification, and other forensic applications. Most of these markers have high allelic variability and mutation rate because of which they have limited use in the phylogenetic reconstruction. In the present study, we have carried out a meta-analysis to explore the possibility of using only five STR loci (TPOX, FES, vWA, F13A and Tho1) to carry out phylogenetic assessment based on the allele frequency profile of 20 world population and north Indian Hindus analyzed in the present study.
Phylogenetic analysis based on two different approaches – genetic distance and maximum likelihood along with statistical bootstrapping procedure involving 1000 replicates was carried out. The ensuing tree topologies and PC plots were further compared with those obtained in earlier phylogenetic investigations. The compiled database of 21 populations got segregated and finely resolved into three basal clusters with very high bootstrap values corresponding to three geo-ethnic groups of African, Orientals, and Caucasians.
Based on this study we conclude that if appropriate and logistic statistical approaches are followed then even lesser number of forensic STR loci are powerful enough to reconstruct the recent human phylogenies despite of their relatively high mutation rates.
Short Tandem Repeats (STR), with a repetitive sequence ranging from 2–6 base pairs are amongst the most polymorphic markers reported till date. They exhibit substantial allelic variability due to high rate of germline mutations . The STR loci have a uniform and dense distribution throughout the genome and exhibit high level of relatively stable polymorphism . All these features makes them an ideal candidate for diverse applications including forensic applications , individual identification, paternity/maternity detection , fine scale genetic mapping  and inter and intra group phylogenetic reconstruction .
However, a specific set of STR can be employed for specific applications and this specificity is solely based on the properties of STR loci involved and their suitability to the particular application. STR loci used for forensic purposes are the one that possess numerous observed alleles, high level of heterozygosity, high polymorphism information content and high power of exclusion. On the contrary, STR loci preferred for the phylogenetic analysis of the human populations are those which have substantial lower allelic counts and carries signature alleles for specific populations [6, 7]. Still, there are few studies in which there is some overlap between the sets of forensic STRs and those exclusively studied for phylogenetic investigation. However, this overlap is not extensive and is without any definitive rationale or design.
There are two school of thoughts regarding the use of forensic STR in phylogenetic studies. According to one view, the requirement of extremely high level of intra group variation along with high mutation rates in forensic systems indicates a rapid diffusion of genetic variation and thus, confers a greater risk of failure in detection of convergent evolution among some populations . Other perception is that random noise generated by allelic variability in forensic systems is not strong enough to veil the evolutionary signals generated by these STR loci. Furthermore, fine scale resolution of forensic STR may prove handful in delineating genetic difference and affinities between closely related ethnic groups .
Population compiled for database of five forensic STR loci
No. of samples analyzed *
Middle Eastern Arabs
North Indian Hindus **
Phylogenetic assessment was carried out through two different approaches – genetic distance and maximum likelihood along with a statistical Bootstrapping procedure involving 1000 replicates. The ensuing tree topologies and PC plots were then compared with those obtained in earlier phylogenetic investigations. The main question that we have tried to address in this meta-analysis is whether a limited number of forensic STR can predict accurate human phylogenies, if the data is evaluated using proper statistical approaches.
Different statistical analysis done on allele frequency data of five STR loci which are important criterion for a good forensic loci
Power of exclusion
Total exclusionary power
NJ phylogenetic tree depicts three different basal groups corresponding to three ethnic groups- African, Caucasian and Orientals. The Caucasian cluster has three monophyletic units- Austria/German, Basque/Portuguese and USA/Canadian Caucasians. Total 8 out of the 14 nodes have bootstrap values more than 50%, most important among them is (i) division between Africans and other two groups of Caucasians and Orientals-863, (ii) the division between Caucasian and other two groups-960 and (iii) the division between Orientals and other two groups-995.
ML phylograms also displayed comparable topology to that of NJ tree. It has also depicted three basal nodes with clear demarcation of Africans, Caucasians and Orientals. Similar to NJ tree, all the three basal monophyletic topologies in ML tree have bootstrap values more than 50% i.e. 931, 790 and 995 respectively for Africans, Caucasians and Orientals. In both the phylogenies, NJ and ML, North Indian Hindus clustered with the Caucasians albeit with low bootstrap value of 456.
First and second Principal component that together constitutes 70.4% of the total variability (PC1-39.3% and PC2-31.1%) were plotted and presented in Figure 1c. There is a significant separation of the Orientals from Caucasians on Y-axis (PC2) and with that of Africans on X-axis (PC1). While Caucasians and Africans reveal relatively clear division along both X-axis (PC1) and Y-axis (PC2), with Middle Eastern Arabs and Moroccan Arabs positioning near to each other. Furthermore, the African populations have clustered into two sub-groups corresponding to the Central and North Africans. North Indian Hindus have clustered with Caucasians.
Five forensic STR loci are found highly successful in providing fine resolution for the reconstruction of recent human evolutionary histories. All three approaches used for phylogenetic reconstruction (NJ and ML tree topologies and PC-plot analysis) have depicted strong racial partitioning and deciphering the accurate phylogenetic information about North Indian Hindus which is in accordance with those derived from other more renounced phylogenetic markers as well as historical evidences [10–12].
The phylograms (NJ and ML) generated from present data set were calculated from CONTML and NJ algorithms, where CONTML works upon the conjecture that random action of genetic drift is the solitary basis of the differences between allele frequencies in different population groups . On the contrary, the NJ algorithm construct a branching array from a matrix of genetic distances calculated from Nei's formula assuming that both genetic drift and mutation causes allele frequency differences .
Both phylograms, NJ and ML have more or less similar basal cluster patterns among the three geo-ethnic groups indicating that component of genetic drift instead of mutation is the major player in these distant estimates. Both the trees have longer African branch than any other group. Such a patristic separation is also visible in PC-plot analysis [Figure 1c]. The African populations have been clustered into central (Cameroon and Lisongo) and North African (Moroccan Arabs and Saharawasi Africans) groups. Such clustering has also been reported by Cavalli-Sforza et al, 2003 based Fst genetic distance based on polymorphisms of 120 protein-coding genes  and Y-chromosome binary haplogroup [16, 17]. This sub clustering further strengthens the utility of the 5 STR loci in deciphering the accurate phylogenies even within the same geographical region. Middle Eastern Arabs display a branch nearer to Caucasians and to some extent near to Moroccan Arabs suggesting strong Caucasian element along with African admixtures suggestive of the Demic expansion of the middle east genes, agriculture innovations and languages into north west Africa [16, 17], which is further supported by the near medial position of Arabs in the PC-plot. Recently, Y-chromosome SNP analysis by Al-Zahery et al. 2003  has also revealed similar pattern in other Middle Eastern populations. European branching pattern differs slightly between the two trees, but still both are resolving completely with Basque, Spaniards and Portuguese having a separate cluster from that of German/Austrian branch. North Indian Hindus clustered with Caucasians in both the phylograms, which is in agreement with the findings of earlier studies. Bamshad et al. 2001 based on mtDNA HVR-I and HVR-II sequencing and Y-chromosome haplotypes has shown that North Indian populations reveal high frequency of west Eurasian haplogroups. North Indian Hindus are basically Indo-Aryan speakers, who invaded from the steppes of central Asia and then settled in the Indus valley, in northwestern India .
The major finding of the present study is the productivity of a limited set of 5 forensic STR loci in resolving the human phylogenies in a similar manner as reported elsewhere on a much higher number of loci. Further, the study also highlights the utility of combined use of varied statistical approaches in reaching a definitive conclusion. Our study scores a point over some of the successful reports like that of Bowcock et al., 1997 which has shown substantial phylogenies but with much larger sets of STR – 30 STR loci. Similarly, Perez-lezaun et al. 1999  has used 20 STR loci and computed Fst based distances depicting similar separation of inter and intra ethnic groups. Even though, in the same study, phylogenetic tree based on DSW distance exhibited a defused picture having trifurcation two Caucasoid and one African group.
Various attempts of phylogenetic reconstruction using forensic STR loci have also been done in recent past like that of Budowle and Chakraborty, 2001 who studied 13 CODIS loci, but their phylogenetic assessment was confined to simple NJ and UPGMA trees and distance measures which yields single output tree. To overcome this, we have incorporated both the phylogenetic approaches i.e. distance and optimal criterion along with statistical bootstrapping which yields 1000 trees and then built a consensus tree. In this regard, a successful attempt was made by Rowold et al. 2003 , by compiling 10 geographically and racially different populations on five forensic STR loci. However, incorporating different set of STR loci, we have been able to compile larger population database of 21 populations.
Overall, the analysis of five forensic STR loci have depicted a strong racial phylogeny indicating that high heterozygosity and/or numerous observed alleles do not necessarily interfere with the phylogenetic information content of the locus, provided that frequency distribution of the populations is significantly different. Significantly, larger number of alleles increases the chances of the presence of signature alleles in segregating populations. Despite all the potential problems associated with forensic STR loci including that of high mutation rates, successfully resolution the genetic difference between inter and intra geo-ethnic groups suggesting that if well-defined statistical approaches are followed, then even a smaller number of forensic STR loci are powerful enough in reconstructing human phylogenies.
A total of 1000 unrelated individuals were randomly selected. Regional addresses and detailed computerized lists were prepared before sample collection. Random numbers were generated with the help of computer and samples were collected from the different collection sites of Uttar Pradesh- Lucknow, Kanpur, Faizabad, Basti, Gonda and Agra. Whole blood was obtained by venipuncture and collected in EDTA vacutainer tubes. Three-generation pedigree charts were prepared to assure un-relatedness in all the samples. The ethical committee of the institute approved the study and blood samples were taken after obtaining informed consent from the subjects.
DNA was extracted by phenol chloroform method as described by Comey et al. 1993  and purified by ethanol precipitation. All the five STR loci were detected by PCR. PCR amplification was performed using flanking primers described elsewhere . The amplified product was separated and detected on 9% PAGE using silver staining.
20 geographically targeted populations were selected from forensic literature , while the data of north Indian Hindus was generated from our lab (Agrawal et al., unpublished data). The criterion of selection was to cover the major geographical and geo-ethnic groups i.e. African, Caucasoid and Orientals. All the populations selected have allele frequency data for five STR loci. In order to embrace a large sample size and to overcome the predicament of some studies focusing only on 2 or 3 STR loci, allele frequency profile of different STR loci analyzed in different populations samples but of the same geographic or ethnic origin has also been included. However, wherever possible, a care has been taken to include the allele frequency profile of the same set of sample for different markers. For example, same 65 samples of Cameroon population has been used for allele frequency data of 5 STR, whereas a large pooled sample size has been used for other groups like Germany, Portugal, Italy, China, Japan etc. In order to avoid the discrepancies, number of samples for each population genotyped for different STR loci and source of allele frequency data is shown in Table 1. Maximum of the populations compiled in the database are pooled samples from different parts of that country. Notably, Sardinians are excluded from Italians, Azores from Portugal and Canary Island from Spain. Middle Eastern Arabs included Arabs mainly from Saudia Arabia, Qatar and Yemen.
Allele frequencies were calculated by a simple gene count method. A total analysis was executed based upon the allelic frequency distribution of the five STR. Heterozygosity, HWE, PIC and power of exclusion was calculated using Cervus v1 . Further, Statistical analysis was executed based upon the allelic frequency distribution of the five STR. A 1000 replicate bootstrap data was generated from SEQBOOT option in PHYLIP version 3.5c . Distance values were estimated using Nei's formula , and a phylogeny was inferred by the neighbor joining (NJ) option in PHYLIP version 3.5c . Phylogenetic reconstruction was also done based on maximum likelihood (ML) and the STR frequency distribution (CONTML in PHYLIP version 3.5c). Finally, a principal component (PC) analysis was generated by POPSTR and first and second PC was plotted as described elsewhere .
This work was supported by Indian Council of Medical Research (ICMR) New Delhi. Authors are thankful to Sanjay Gandhi Post Graduate Institute of Medical Sciences Lucknow for providing various lab facilities and other assistance. We thank Ms. Sudha Talwar, Ms. Manorama Tripathi and Mr. Atul Pandey for providing technical support.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.