Genetic structure of Indian populations based on fifteen autosomal microsatellite loci

Background Indian populations endowed with unparalleled genetic complexity have received a great deal of attention from scientists world over. However, the fundamental question over their ancestry, whether they are all genetically similar or do exhibit differences attributable to ethnicity, language, geography or socio-cultural affiliation is still unresolved. In order to decipher their underlying genetic structure, we undertook a study on 3522 individuals belonging to 54 endogamous Indian populations representing all major ethnic, linguistic and geographic groups and assessed the genetic variation using autosomal microsatellite markers. Results The distribution of the most frequent allele was uniform across populations, revealing an underlying genetic similarity. Patterns of allele distribution suggestive of ethnic or geographic propinquity were discernible only in a few of the populations and was not applicable to the entire dataset while a number of the populations exhibited distinct identities evident from the occurrence of unique alleles in them. Genetic substructuring was detected among populations originating from northeastern and southern India reflective of their migrational histories and genetic isolation respectively. Conclusion Our analyses based on autosomal microsatellite markers detected no evidence of general clustering of population groups based on ethnic, linguistic, geographic or socio-cultural affiliations. The existence of substructuring in populations from northeastern and southern India has notable implications for population genetic studies and forensic databases where broad grouping of populations based on such affiliations are frequently employed.

of the Indian population is attributed to incessant, historical waves of migrations into India, the earliest, by the Austric speakers around 70,000 years ago, followed by the Dravidian speakers from middle-east Asia and the Sino-Tibetan speakers from China and southeast Asia around 8000 to 10,000 years ago. The last major migration is believed to have occurred around 4000 years ago by several waves of Indo-European speakers [2]. Earlier genetic studies to understand the prevailing diversity among extant Indian populations analyzing populations that were predefined either based on ethnicity, language, culture or geography have interpreted existence of different levels of genetic relationships among population groups [3][4][5][6] that broadly attest the theories of migration and assimilation of different populations. However, recent molecular analyses have also asserted genetic similarity across populations spread over diverse geographic regions of the country, revealing a gradation of genetic lineages underscoring the genetic correlation amongst populations [7,8].
The striking social attribute of the Indian populations is their strict practice of endogamy across all social ranks that has resulted in emergence of diverse population-specific social traditions and formation of distinct linguistic dialects due to subsequent isolation of populations. Although uniparental, biallelic markers have deciphered the common major Paleolithic contributions [9], resolution of many sub-lineages is still awaited in order to decipher finer genetic signatures defining populations that have resisted admixture for centuries. Patterns of variation across recently diverged populations can be successfully characterized with fast-evolving microsatellite markers [10] [11] [12]. Genetic drift among isolated, small populations manifests as characteristic allele frequency patterns that have been recently effectively characterized to identify genetic clusters that corresponded well with predefined geographically or linguistically similar populations [13].
With these rationales, we have analyzed 15 highly polymorphic autosomal microsatellite markers including 13 core forensic loci, which have been extensively used to reveal the ethnological and anthropological affinity of diverse . In order to decipher if geographic proximity, linguistic, ethnic and socio-cultural affiliations have played a role in genetic differentiation of extant Indian populations these markers were analyzed in over 3522 individuals drawn from 54 endogamous populations representing major ethnic and linguistic groups spread across diverse geographic regions of the country ( Table 1). Distribution of alleles across populations was evaluated to ascertain presence of group-specific patterns if any. Extent of molecular variance evident among pre-defined groups based on ethnicity, language, geography and socio-cultural hierarchy was evaluated to determine if such classifications were supported genetically. In addition, a model-based clustering algorithm was applied to infer population groups differentiated by their characteristic allele frequencies and to detect presence of cryptic population subdivisions.

Results
A number of alleles of the different microsatellite loci analyzed were found to be present unique to specific populations with discernable distribution along geographic and ethnic affiliations evident only among few of the populations. Populations like the Gond (a tribal population) from Chattisgarh; Irular, Chakkiliyar, Gounder and Pallar (Australoid populations) from the southern state of Tamil Nadu; showed genetic isolation, evident from the presence of alleles confined within these populations ( Figure  1). On the contrary, allele 15.2 of the D3S1358 locus was found to be prevalent among the Gowda and Muslims in the state of Karnataka and allele 18.2 of the FGA locus was present among the Thakur and Kurmi of Uttar Pradesh exhibiting a regional distribution. Sharing of allele 24.2 of the FGA locus was also observed between Lepcha and the Nepali of Sikkim, who share similar ethnic and geographic origins.
Significantly, most frequent alleles were shared among some ethnically and linguistically related populations. The populations of Sikkim, Lai and Lusei of Mizoram that shared Mongol ancestry had a high frequency of allele 12 of the D7S820 locus. Analogous results were obtained for allele 13 of the D5S818 locus, which was in high incidence amongst the Bhutia of Sikkim and Mara of Mizoram. The Indo-Caucasoids, Lingayat of Karnataka; Yadav and Baniya of Bihar, and the geographically proximate Australoid, Kurmi had allele 7 of the Penta E locus in high frequency. Allele 18 of the same locus was present in high frequencies among the Dravidian speaking Australoids, Gowda of Karnataka; Irular of Tamil Nadu as well as among the Indo-European speaking Indo-Caucasoids, Khandayat and Gope of Orissa.
Analysis of molecular variance ( Table 2) failed to support the geographic, ethnic, linguistic or socio-cultural grouping of Indian populations suggesting little variation between the different groups. We then employed a clusterbased algorithm to ascertain the extent to which the observed discrete patterns of allele distributions would delineate populations. In order to maintain uniformity of estimated probabilities across runs for a given value of K with large datasets [20], we initially used small K to analyze the 54 populations in this study and then subdivided the dataset into smaller groups to dissect the regional diversity.
In the countrywide dataset, at K = 5, associated with maximum posterior probability (Table 3), individuals displayed partial membership to multiple clusters with some populations exhibiting distinctive identities that did not correspond to geographic, linguistic or ethnic affiliation ( Figure 2). Populations such as Thakur and Khatri from Uttar Pradesh and Baniya from Bihar showed similarity with southern populations such as Naikpod Gond and Chenchu from Andhra Pradesh and with a few individuals from Maharashtra and Lepcha of Sikkim. Populations from the northeastern state of Mizoram exhibited a distinct clustering, different from populations of similar eth-Alleles with significant distribution among the different groups of India for the studied microsatellite markers Figure 1 Alleles with significant distribution among the different groups of India for the studied microsatellite markers. ❍ represents alleles occurring at a high frequency and ᮀ denotes unique alleles present in a population.
nicity from Sikkim, while some individuals from Saora and Gope from the eastern state of Orissa shared a similar degree of membership as the Mizoram populations. Of the southern populations, those from Karnataka and Andhra Pradesh were differentiated into two groups with populations from Tamil Nadu exhibiting split membership to both groups. At the regional level ( Figure 3), amongst northern Indian populations, at K = 5, where the highest posterior probability was associated, Thakur were identified to be distinct from Jat and Uttar Pradesh Kurmi. The Khatri were found substructured with few individuals exhibiting membership similar to the Thakur.
In the East, Bihar Brahmin, Bhumihar, Kayasth, Rajput, Yadav, Bihar Kurmi, Orissa Brahmin, Khandayat, Karan, Juang and Paroja shared similar membership to multiple clusters revealing a common genetic structure. Baniya of Bihar were found similar to two of the northern Indian populations while Gope and few individuals from Saora of Orissa shared similar identities as the populations from Mizoram of northeastern India.
The northeastern populations from Mizoram were identified to be distinct from those of Sikkim. Three clusters were evident with Hmar, Mara, Lai and Lusei of Mizoram all representing one group while Lepcha of Sikkim were distinct representing the second group and the third group comprised Nepali and Bhutia of Sikkim.
In the south, Lingayat, Gowda, Brahmin and Muslim of Karnataka along with Vanniyar, Gounder and Pallar of Tamil Nadu separated from rest of the populations. Irular of Tamil Nadu and Yerukula of Andhra Pradesh presented distinct identities while Chenchu and Naikpod Gond of Andhra Pradesh exhibited similar affinities. Rest of the populations from Tamil Nadu; Chakkiliyar, Paraiyar, Tanjore Kallar and from Andhra Pradesh; Brahmin, Raju, Komati, Kamma Chaudhury, Kapu Naidu, Reddy and Lambadi displayed mixed membership to multiple clusters.
Populations from western and central India showed absence of any distinct grouping with individuals having symmetrical membership across inferred clusters. The above results reveal genetic similarity across populations with a few presenting distinct identities that did not follow traditional groupings of geography, language or ethnicity. Populations from southern India and northeastern India largely exhibited structuring while most Indian populations shared similar membership in multiple clusters.

Discussion
Contemporary molecular studies on Indian populations were focused to uncover the genetic relationship among geographically, linguistically or ethnically related populations [21][22][23][24][25]. Recently, few studies involving a larger number of populations have correlated the genetic relatedness of the populations with linguistic [6] or socio-cultural affinities, [3,5] though genetic uniformity across populations has also been largely observed [7,8]. The current study employs microsatellite markers to decipher allele frequency changes that would effectively detect recently isolated populations whose times of divergences were shorter than those detectable by uniparental markers. Distribution of alleles across the microsatellite loci studied among the populations predominantly demon- strates the occurrence of alleles unique only to a few populations ( Figure 1). This pattern is probably due to the result of genetic isolation and drift experienced by the populations that follow strict endogamous practices. The distribution of the most frequent allele was in general, uniform across populations suggesting their common origin. Earlier reports have also suggested geographic contiguity favoring gene flow among populations [26]. Although ethnic and geographic propinquity were discernible from the allele distribution patterns across few populations in the current study, no consistent pattern across all populations of any particular group was observed. This was also evident from the analysis of molecular variance that failed to support any grouping; ethnic, linguistic, geographic or socio-cultural in contrib-uting to the extant genetic structure of Indian populations.
The immense diversity within the ethnic and linguistic affiliations of the populations inhabiting India had always been a debatable issue, whether some of them had originated indigenously or were the results of earlier migrations [27][28][29]. The distinct grouping of the populations of Mizoram ( Figure 3) does concord with earlier reports [4,30] that northeastern India was peopled by migration of Tibeto-Burman speakers from East Asia. However, Tibeto-Burman speaking populations of Sikkim grouped separately and exhibited considerable gene-flow with non-Tibeto-Burman speakers. It is probable that these two regions were peopled by different waves of Estimated population structure in different geographic regions Figure 3 Estimated population structure in different geographic regions. Bar plot estimation figures for North, East, Northeast, South, West, and Central were based on the highest probability run at that K. migration from Southeast and East Asia. Interestingly, eastern Indian populations; Saora and Gope also exhibit similarities to the populations of Mizoram indicating shared genetic ancestry. Though the Lepcha were distinct at the highest-likelihood run for K = 4 ( Figure 3), in other runs with lower K, they grouped with the rest of the populations from Nepal (data not shown).
Majority of the Indian populations in general exhibited extensive admixture with each population displaying membership to multiple clusters. Populations such as Khatri, Baniya, Chenchu, Yerukula and Naikpod Gond, however, were substructured. Interestingly, populations comprising the southern Indian region exhibited substructuring with a number of populations clustering into a separate group while the rest were found similar to the general Indian population structure. This group comprising Iyenger Brahmin, Lingayat, Gowda and Muslim from Karnataka and Gounder, Vanniyar and Pallar from Tamil Nadu probably represents those populations that have resisted recent geneflow, and accumulated characteristic allele frequencies because of genetic drift leading to their differentiation from the rest of the populations. In addition, Irular of Tamil Nadu and Yerukula of Andhra Pradesh were found distinctive while Chenchu and Naikpod Gond of Andhra Pradesh grouped together. However, these populations at lower K grouped into clusters similar to those of Tanjore Kallar, Paraiyar and Chakkiliyar of Tamil Nadu and Brahmin, Raju, Komati, Kamma Chaudhury, Kappu Naidu, Kapu Reddy and Lambadi of Andhra Pradesh.

Conclusion
Our analyses failed to reveal any genetic groups that correlate to language, geography, ethnicity or socio-cultural affiliation of populations. Of course, the absence of evidence of structuring of the Indian populations based on ethnic, linguistic, geographic or socio-cultural affiliations may be related to the ascertainment bias of selection of these highly polymorphic forensic microsatellite markers. Future studies employing a large number of microsatellites/SNPs might yield higher resolution to decipher stronger associations between populations. The occurrence of few populations distinct from the general populace suggests genetic drift due to isolation of such populations have resulted in their characteristic allele frequencies. This cryptic population structure would have significant implications in forensic investigations where computations of statistical significance of a DNA match rely on ethnic identities often defined by the country of origin. The existence of substructuring in populations from northeastern and southern India also cautions against broad grouping of populations based on geographic, ethnic or linguistic affiliation that are frequently employed in population genetic studies.  We used a model-based clustering method for inferring population groups using genotype data consisting of unlinked markers as implemented in Structure 2.1 program [33]. The model assumes there are K populations (where K may be unknown), each of which is characterized by a set of allele frequencies at each locus. Individuals in the sample are assigned probabilistically to populations, or jointly to two or more populations if their genotypes indicate they are admixed. Each run used 100,000 estimation iterations for K = 2 to 8 after a 20,000 burn-in length. Each run was carried out several times to ensure consistency of the results. Posterior probabilities for each K were computed for each set of runs.