Genetic variation in South Indian castes: evidence from Y-chromosome, mitochondrial, and autosomal polymorphisms

Background Major population movements, social structure, and caste endogamy have influenced the genetic structure of Indian populations. An understanding of these influences is increasingly important as gene mapping and case-control studies are initiated in South Indian populations. Results We report new data on 155 individuals from four Tamil caste populations of South India and perform comparative analyses with caste populations from the neighboring state of Andhra Pradesh. Genetic differentiation among Tamil castes is low (RST = 0.96% for 45 autosomal short tandem repeat (STR) markers), reflecting a largely common origin. Nonetheless, caste- and continent-specific patterns are evident. For 32 lineage-defining Y-chromosome SNPs, Tamil castes show higher affinity to Europeans than to eastern Asians, and genetic distance estimates to the Europeans are ordered by caste rank. For 32 lineage-defining mitochondrial SNPs and hypervariable sequence (HVS) 1, Tamil castes have higher affinity to eastern Asians than to Europeans. For 45 autosomal STRs, upper and middle rank castes show higher affinity to Europeans than do lower rank castes from either Tamil Nadu or Andhra Pradesh. Local between-caste variation (Tamil Nadu RST = 0.96%, Andhra Pradesh RST = 0.77%) exceeds the estimate of variation between these geographically separated groups (RST = 0.12%). Low, but statistically significant, correlations between caste rank distance and genetic distance are demonstrated for Tamil castes using Y-chromosome, mtDNA, and autosomal data. Conclusion Genetic data from Y-chromosome, mtDNA, and autosomal STRs are in accord with historical accounts of northwest to southeast population movements in India. The influence of ancient and historical population movements and caste social structure can be detected and replicated in South Indian caste populations from two different geographic regions.


Background
The origins and genetic affinities of India's populations have been debated extensively [1][2][3][4][5][6]. Archaeological studies document human occupation of the subcontinent from the lower Paleolithic through the Neolithic, including a flourishing ancient civilization in the Indus Valley [7]. The historical record documents an influx of Vedic Indo-European-speaking immigrants into northwest India starting at least 3500 years ago [8][9][10][11]. These immigrants spread southward and eastward into an existing agrarian society dominated by Dravidian speakers [12]. With time, a more highly-structured patriarchal caste system developed [7,9,10]. India is now broadly characterized by Indo-European (e.g. Hindi, Urdu, and Punjabi) speaking populations found in the central and northern regions and by Dravidian (e.g. Tamil, Telugu, and Kannada) speaking populations in the southern and southeastern regions. The extent to which ancient and contemporary migrations, and the more recent inception of a hierarchical caste system, have influenced the genetic composition of modern Indian populations remains controversial.
A number of studies have addressed the genetic contribution of other Eurasian populations to Indian caste and tribal populations [1][2][3]6,[13][14][15][16][17][18]. They have arrived at somewhat different conclusions regarding the origins of castes, their relationships to each other, and their relationship to populations outside India. These discordances can be attributed, in part, to differences in sampling strategies and the varied effects of gene flow between the typically endogamous castes and tribes [14,[19][20][21].
Several trends regarding the origin and affinities of Indian populations have emerged. The predominantly south and east Asian mtDNA haplogroup M is found in more than half of individuals from a wide sampling of castes [5,6,13,22] and is nearly fixed in some Austro-Asiatic tribal populations [6]. This haplogroup is uncommon in western European populations [23,24]. In contrast, some paternally-inherited Y-chromosome lineages are more closely related to lineages originating in central Asians and Europeans [1,13,25,26]. Genetic distances estimated from autosomal polymorphisms have typically demonstrated that caste populations tend to occupy a position intermediate between European and East Asian populations [8,[27][28][29].
The genetic affinities among the more than 2000 extant caste populations of India, however, are complex. Genetic distances between caste populations from the state of Andhra Pradesh, India, are correlated with differences in caste rank, suggesting that endogamy and differential inter-caste gene flow influences genetic structure [30]. Several studies have found a similar pattern, [31][32][33] but oth-ers have not [6,34]. Higher rank castes may show closer affinity to European populations than do other caste populations [13]. Recent Y-chromosome data suggest a higher affinity between tribal populations and castes of lower rank [35].
These results support historical accounts of nomadic pastoralists from central and northwestern Eurasia integrating with existing local populations, and either introducing a system of social stratification or becoming members of the existing upper castes [8,9,35]. Yet, the occurrence of Ychromosome haplogroups L, H, R2, and R1a in both caste and isolated tribal populations suggests much of the existing Indian population structure is very old [5]. Additionally, the high diversity of Y haplogroups R1a1 and R2 in both South Indian and Indus valley populations has led to the suggestion that there is little, if any, genetic influence from other Eurasians on the castes of South India [3].
A broad study of 24 castes from various locations throughout India concluded that genetic data were not congruent with "sociocultural" affinities due to high rates of gene flow [6]. Yet, this study and others [1,36] have suggested a clinal (north to south) contribution of central Asian Ychromosomal lineages to caste populations. Due to wellestablished clines in gene frequencies across India, especially in the north-south direction, [2,34,36] comparisons of castes from different geographic locations can conflate clinal variation with variation that may exist between local caste groups. Therefore, it is important to obtain large, carefully chosen samples from the same geographic locale to determine whether previous results [13] indicating caste-related genetic structure can be replicated in other regions of India [37]. Additionally, because single linkage groups such as the non-recombining region of the Y-chromosome or the mtDNA genome may be strongly influenced by genetic drift or selection, the use of a large number of independent autosomal polymorphisms can greatly improve the reliability of estimates of population relationships.
In this study we analyze four castes of different rank sampled from Tamil Nadu, the southern-most state of the Indian subcontinent. The genetic relationship among the Tamil castes, their relationship to castes from the neighboring state of Andhra Pradesh, and their affinity to other Eurasian populations are examined using Y-chromosome, mtDNA, and autosomal polymorphisms. We show that the genetic affinities between Indian castes from Tamil Nadu and other Eurasians are broadly congruent with patterns observed previously for castes from Andhra Pradesh. These results strengthen the conclusions drawn from our previous analyses regarding caste relationships in South India and suggest reproducible patterns regarding the genetic influence of ancient and historical events on the Indian caste system.
Some between-caste trends are suggested by the data. The F* lineage is found at higher frequencies in lower castes than in upper or middle castes. The R1a1 lineage occurs at a higher frequency in upper vs. lower castes and differed significantly in frequency in Andhra upper vs. Andhra lower (p < 0.05). These trends appear in castes from both Tamil Nadu and Andhra Pradesh. Lineage H also reaches substantial frequency in the Tamil lower caste but is less common in upper and middle castes. Lineage J2, previously shown to be distributed in a northwest to southeast gradient, [3] was present in all castes but not correlated with caste rank. Tamil castes are characterized by high frequencies of mitochondrial M and N super-family lineages, and all South  Indian lineages could be assigned to either M or N clades  (Tables 3 and 4). Both major haplogroup super-families are deep-rooting in South Indian populations, with diversity estimates for N (0.01589, n = 63) exceeding that for M (0.01044, n = 92), based on HVS1 data. In contrast to the South Indian mtDNA haplogroup pool, the eastern Asian and European groups have predominantly either M or N lineages, respectively. High diversity and deep-coalescence dates (> 40 K ybp) for both major mtDNA superfamilies are consistent with an ancient and continuous presence of populations in South India that greatly predates the documented history of the caste system.

mtDNA haplogroups
To further examine potential western and central Eurasian contributions to South Indian castes, mitochondrial U lineages, defined by coding variant 12308G, were analyzed in greater detail (Table 5). U haplogroup subtypes were assigned using key HVS1 variants as previously described [4,38]. South Asian lineages U2a and U2c are common in Tamil and Andhra castes. U7 is the most prevalent U lineage in Tamil and Andhra castes. U7 is also common in Iran, Pakistan, and northern India, [39] suggesting an affinity between Dravidian populations from South India and populations to the north and west. A comparison of HVS1 for U7 haplogroups (10) with Indian/Pakistani HVS1 sequences available in the mtDB database (4) revealed similar but non-identical motifs, suggesting ancient rather than very recent gene flow between northwestern and southern India. A notable This trend is also present in the Andhra sample, but it is not significant.

Genetic distances
We calculated genetic distances between Tamil castes, Europeans, and East Asians and compared these results to those from upper, middle, and lower caste groups from the neighboring state of Andhra Pradesh. The genetic distance estimates reveal several distinct patterns (Table 6).
For Y-chromosome polymorphisms, all castes have smaller distances to Europeans than to eastern Asians. For mtDNA polymorphisms, all castes have smaller distance estimates to eastern Asians than to Europeans. For Y-chromosome data, the genetic distance estimates to the Europeans is ordered by caste rank. These trends appear in castes from both geographic regions.
A neighbor-joining network depicts the between-population relationships based on Y-chromosome data ( Figure  2). The NTS Upper caste is more closely related to the Andhra Upper caste than to the other Tamil castes, a finding consistent with a common language (Telugu) shared by the NTS Upper and Andhra upper castes. All castes are closer to Europeans than to eastern Asians, and basal haplogroup R is common, especially in the upper castes and Europeans. The inset, however, shows that haplogroups derived from R are not commonly shared between this sample of Europeans and southern Indians. Affinity between the groups is driven largely by basal characters (R, F* and H) that have contrasting frequency patterns.
A neighbor-joining network based on distance estimates from 45 STRs shows a greater affinity of all castes to Europeans than to eastern Asians ( Figure 3). With the exception of the NTS Upper (Telugu and Kannada speaking) Brahmins, castes of similar rank from different geographic locations tend to branch at similar locations within the network. Within each geographic region, the distances to other Eurasians (both Europeans and East Asians) increases with decreasing caste rank.
The network based on mitochondrial distance estimates shows little between-caste rank organization, yet reveals the greater affinity of all castes to eastern Asians for maternal lineages ( Figure 4). Basal U haplogroups are less frequent in lower rank castes from both southern India locations. The inset shows that only a few high-resolution U haplogroups (U5, K) are shared between Europeans and South Indians.

Genetic structure
The proportion of genetic variation distributed within and between South Indian castes was assessed by an analysis of molecular variance (AMOVA) ( Table 7). The Tamil South Indian castes are only modestly differentiated from    India one another: 0.96% of STR variance occurs between Tamil castes. A similar value of 0.77% for between-population (caste) difference is observed in the Andhra castes. A smaller fraction, 0.12%, is attributable to geographic differences between Tamil and Andhra locations and was not significantly different from zero. Removal of the NTS Upper caste from the comparison yielded a non-significant but higher value of 0.28%. These findings, based on multiple unlinked loci, suggest that social structure has had a larger impact on caste population structure in these South Indian samples than geographic separation.

Map of South India
Y-chromosome and mtDNA estimates of molecular variance between castes samples from either Tamil Nadu or Andhra Pradesh also exceed the estimate for betweengroup variation for the two geographic regions. Betweencaste variation for mtDNA in Tamil populations is greater than that for Andhra populations. This may be partly due to regionally high female mobility in Andhra castes as previously reported [20,30]. As expected, for all genetic systems, the vast majority of all variation occurs within populations.
The degree of population subdivision among Indian castes was estimated using a model-based clustering method implemented in STRUCTURE (ver. 2.1). The best estimate of the number of clusters (K) was consistently one for the Tamil Indians. The best estimate of the K clusters was also one for Tamil and Andhra castes together. This result indicates that individuals from castes spanning the Indian social hierarchy from two independent geographic regions are not sufficiently differentiated to allow clustering into groups based on genetic data from 45 STR polymorphisms alone. This finding is consistent with the low R ST values for these populations but may also reflect the limited power of 45 STRs to distinguish such closely related populations. Estimates for heterozygosity and repeat variance in these populations also indicate no substantial between-caste differences or excess homozygosity in these caste groups (Table 8).
We evaluated the correlation between caste rank and genetic distance using a Mantel test (Table 9). For each test, a correlation between pairwise genetic and pairwise caste rank distances matrices using the Tamil caste individuals was calculated. For Tamil-speaking populations, all genetic systems produced low, significant positive correlations. Y-chromosome haplogroup data yielded the highest positive correlation with caste rank (ρ = 0.26, p <

Discussion
Using a geographically well-defined sample of caste populations from Tamil Nadu, India, this study arrives at many conclusions similar to those from our previous studies of caste populations from Andhra Pradesh, India [13,20,30]. In both cases, there is extensive sharing of Y and mtDNA haplogroups among castes, and the overall level of inter-caste differentiation is low. This finding is consistent with many other studies of genetic structure and gene flow patterns among caste populations [6,32,33,40].  Total  41  37  43  34  33  80  54  57 28 322 (Lineages grouped by M or N superfamily; *could not be further resolved; see Materials and Methods for marker information.)  Paternally-inherited Y-chromosome SNPs show that caste populations have greater affinity to a sample of Europeans than to a sample of eastern Asians. Unlike the Y-chromosome data, maternally-inherited mtDNA polymorphisms demonstrate a contrasting pattern -castes, regardless of rank, have higher affinity to eastern Asians than to Europeans. These patterns were present in samples from both geographical locations suggesting that South Indian paternal lineages have been more substantially influenced by western or central Eurasians compared to South Indian maternal lineages. Unlike our previous study of Andhra castes, [13] direct haplogroup sharing between Tamil castes and our sample of Europeans is more limited, suggesting a potentially greater time depth for the development of these patterns. More extensive sampling will be required to resolve this difference.
Using Y-chromosome data, Tamil castes of different rank have differential affinities to our sample of Europeans, with upper castes demonstrating greater affinity than lower castes. Genetic distances are weakly correlated with caste rank distances and correlations from Y-chromosome data are stronger than correlations based on mtDNA or autosomal data. This pattern argues for a differential contribution of male lineages to castes of different rank and limited male mobility between castes in South India.
An interesting difference between the data sets from Andhra Pradesh and Tamil Nadu is also observed. For the former sample, inter-caste distance based on mtDNA polymorphisms (HVS1 sequence) demonstrated a strong relationship to caste rank, while distances based on Ychromosome data did not. This was interpreted as evidence of historical upward female mobility in the caste system [30]. (We note, however, that the primary reason for a lack of correlation between Y-chromosome distances and caste rank was close affinity between the upper-caste Brahmin and lower-caste Relli samples [20].) In contrast, the Tamil Nadu samples show a higher correlation between Y-chromosome distances and caste rank than between mtDNA distances and caste rank. This difference likely reflects differential apportioning of individuals as In a study of broadly distributed Indo-European and Dravidian castes, Sengupta et al. (2006) suggested that the majority of Indian Y-chromosome haplogroups are at least 10,000 to 15,000 years old as gauged by Y-chromosome microsatellite diversity, thus predating the origin of the caste system. The antiquity and complex geographic distribution of the R1a1 and R2 haplogroups led these authors to conclude that the majority of the subcontinent Y-chromosomes arrived in or before the early Holocene (10,000 years ago) rather than in a later Indo-European expansion. Likewise, and concordant with other studies of tribal Indian populations, [5] we observe Y-chromosome R1a1 lineages in South Indian tribal Irula (unpublished data), a population substantially differentiated from South Indian castes [18].  Genetic distances for Y-chromosome data  Indian mtDNA lineages demonstrate high diversity, suggesting that a majority of Indian maternal lineages are also relatively old and likely predate historically documented expansion events [38,42]. Older, deep-rooting mitochondrial lineages belonging to the N macrolineage are prevalent in western Eurasia and are distributed in a West -East cline, with high frequencies in Anatolia and Iran and moderate frequencies in Pakistan and northwestern India [43]. In this study we observe higher frequencies of basal This may indicate differences in founding populations. More likely, though, it may suggest ancient migration and integration of various U haplogroups into different pre-caste populations with subsequent, non-uniform lineage sorting and differentiation over time. In contrast, and consistent with early human expansion across South Asia, the predominantly Asian M clade mitochondrial haplogroups account for more than half of all Indian mitochondrial lineages and reach their highest frequencies in lower caste and tribal groups [6,13].
While Y-chromosome and mtDNA polymorphisms yield valuable information, it must be borne in mind that they each represent a single linkage group. Estimates based on these systems are thus subject to a high level of stochastic variability [44,45]. In addition, the Y-chromosome and mtDNA may both have been affected by natural selection, [46,47] which can further complicate the interpretation of population history. Coalescence dates based on these systems must also be viewed with appropriate caution, in part because of their large confidence intervals. More importantly, a coalescence date is not necessarily a reliable indicator of the founding date of a population [45] because these dates are affected by the size of the founder population and by subsequent gene flow patterns. To gain a more complete and reliable portrait of population history, multiple, independent autosomal polymorphisms should also be examined.
Our analysis of 45 unlinked autosomal STRs reveals that in Tamil Nadu, genetic distances between castes are positively correlated with caste rank. A similar pattern was detected in upper, middle, and lower rank castes of Andhra Pradesh using these STRs [20] and Alu and L1 insertion polymorphisms [13]. An analysis of the Kallar, Vanniyar, and Pallar castes, which also reside in Tamil Nadu, showed that upper -lower caste distance estimates (0.0553) exceeded those for upper -middle castes (0.0329) and middle -lower castes (0.0515) [40]. Majumder et al. [37,48] presented Y-chromosome, mtDNA, and autosomal data from several caste populations in Uttar Pradesh. Subsequent analysis indicated that caste rank was correlated with genetic distance for all three types of systems [20]. Similar correlations have been observed in a number of other studies of Indian populations [31,33,49]. A relatively greater affinity between upper-caste populations and Europeans has been observed for autosomal polymorphisms in our Andhra Pradesh and Tamil Nadu samples and in a number of other analyses of autosomal data [6,50,51].   Although significant correlations between caste-rank and genetic distances are apparent, model-based clustering algorithms did not detect structure within the Tamil or Andhra populations. We suggest that this finding results from the low amount of differentiation between all caste groups but also from a lack of sufficient power in 45 unlinked STRs to detect high-resolution population structure. With ~250 K SNPs typed in a subset of the Andhra upper and Andhra lower castes, individuals can be clustered into these population groups using genotype information alone [52]. Likewise, using > 950 K SNPs, the Tamil upper and Tamil lower castes demonstrate groupspecific clustering by principal component analysis (unpublished data).
Considering the complex history of Indian populations, it is not surprising that some studies demonstrate an association between caste rank and genetic distance, whereas others do not. A recent study of 15 geographically dispersed Indian populations residing in the United States using 1200 markers found little evidence for caste or geographic structure [53]. However, sampling strategy (relocated vs. in situ) or other factors, such as a very wide geographic dispersion of the study populations, may confound correlations if they exist. Admixture and gene flow can also vary substantially between caste populations in the various regions of India. Linguistic differences may influence the genetic structure of local caste populations [34]. The linguistically different NTS Upper caste Brahmins showed several differences in comparison to the other Tamil castes in this analysis. Yet, because Indian populations show only a small amount of genetic differentiation, [17,53] a large number of autosomal loci will be necessary for adequate power to detect consistent patterns of variation if they are present [54,55]. Ancestryinformative autosomal polymorphisms, high-density genotyping, and extensive population sampling will provide better resolution of the relationships between Indian and other Eurasian populations.
The results presented here underscore the complexity of the Indian caste system. Although other interpretations may be possible, our data are consistent with a model in which nomadic populations from northwest and central Eurasia intercalated over millennia into an already complex, genetically diverse set of subcontinental populations. As these populations grew, mixed, and expanded, a system of social stratification likely developed in situ, spreading to the Indo-Gangetic plain, and then southward over the Deccan plateau. A strong patrilineal social structure, accompanied by a developing practice of caste endogamy, may have contributed to an asymmetric apportioning of Y-chromosome, autosomal, and to a lesser extent, mtDNA lineages. Remnants of these patterns can still be detected in some of the inhabitants of peninsular South India.

Conclusion
Genetic variation between South Indian castes from Tamil Nadu is low (R ST = 0.0096). Tamil caste Y-chromosomes and STR alleles are more similar to Europeans than to eastern Asians, and genetic distance estimates to Europeans are ordered by caste rank. In contrast, Tamil caste mtDNA shows greater similarity to eastern Asians than to Europeans. Low, but statistically significant, correlations between genetic distance and caste rank can be demonstrated for the Tamil-speaking populations. These patterns likely reflect asymmetric influences of ancient and historical processes on the caste system as it developed. These findings provide a general replication of our analysis of ranked castes from the neighboring state of Andhra Pradesh, India [13]. For the caste populations analyzed here, between-caste genetic differentiation exceeds that due to geographic (between-state) differentiation, a finding that may be of considerable interest when initiating linkage mapping [56] and case-control association studies in South Indian populations.

Study Subjects
Study subjects were recruited from four caste groups in Tamil Nadu, India. Tamil [13].

Data collection
DNA was extracted from venous blood using standard procedures. Hypervariable sequence 1 (HVS1), corresponding to base pairs 16000 -16410, was amplified by PCR and sequenced using BigDye 3.1 dye-terminator fluorescent sequencing chemistry and an Applied Biosystems (ABI) 3100 automated sequencer.
To allow a direct comparison of Y-chromosome haplogroups from Tamil Nadu castes to those from Andhra Pradesh castes, we typed individuals from Andhra Pradesh for 26 of the 32 lineage-defining SNPs. A Y-haplogroup was assigned to each sample by the presence of one or more derived-state alleles, and the remaining alleles were inferred. This SNP panel allowed further refinement of the haplogroups previously reported for the Andhra Pradesh samples [13,30].

Data analysis
Haplogroups for the Y-chromosome (32 SNPs) and mtDNA (32 SNPs and 411bp HVS1 sequence) were assigned using SNP data. Mitochondrial haplogroups were assigned to a haplogroup based on the most probable consensus of polymorphic changes or resolved using previously published mtDNA HVS1 motifs as a guide [62]. Thirty-one exceptions to the canonical mtDNA phylogeny occurred on 27 mtDNA haplogroups, and these haplogroups with recurrent mutations were assigned to the most likely haplogroup based on HVS1 sequence data [4,6]. The variant 7598A, defining mtDNA lineage M-E, was found in 2 Tamil and 1 Andhra individuals who share identical HVS1 motifs but lack the preE 4491A variant. Between-caste haplogroup differences were evaluated for significance using Fisher's exact test.
Diversity estimates (F ST , R ST , and AMOVA) for Y-chromosome, mtDNA, and autosomal STRs were calculated using the ARLEQUIN 3.0 software package [65]. AMOVA statistics were evaluated for significance by comparison to an empirical distribution generated by random permutation of genotypes or haplogroups. A general age estimate for mtDNA coalescent dates was calculated by the method of Nei [66] using a substitution rate of 2 × 10 -7 substitutions/ site/year [67].
Model-based analyses of population structure were performed using the STRUCTURE program [68]. An estimate of the optimal number of clusters (K) for the four Tamil castes was obtained from the posterior probabilities of K, P(X|K), averaged over 10 runs for each value of K. A uniform prior probability distribution was assumed on K = {1...n}, and burn-in and iterations were set to 10,000 each for estimating the best K. Estimates of proportionate membership to three clusters were averaged values from 10 independent STRUCTURE runs. Population admixture and correlated allele frequencies were used in all analyses.
The correlation between genetic distance and caste rank was assessed by Mantel matrix tests using Spearman's rank correlation. For all possible pairs of caste individuals, inter-individual genetic distance estimates were calculated using DNADIST (Y and mtDNA) [69] or the D sw program (STRs) [70]. Next, each individual was assigned a ranking (1, 2, or 3) for upper, middle, and lower caste status. The difference in caste rank was calculated for all possible pairs of caste individuals, yielding a full pair-wise matrix (155 × 155, or 118 × 118 for Tamil-speakers only) of ordinal values (0, 1, 2). Spearman's rank correlation between the genetic distance (Y-chromosome, mtDNA, or autosomal STRs) matrix and the caste rank difference matrix was calculated. A significance level for the correlation was determined by comparing the actual correlation to a distribution of correlations generated by 10,000 random columnar permutations.