- Research article
- Open Access
Rare intronic variants of TCF7L2 arising by selective sweeps in an indigenous population from Mexico
BMC Genetics volume 17, Article number: 68 (2016)
Genetic variations of the TCF7L2 gene are associated with the development of Type 2 diabetes (T2D). The associated mutations have demonstrated an adaptive role in some human populations, but no studies have determined the impact of evolutionary forces on genetic diversity in indigenous populations from Mexico. Here, we sequenced and analyzed the variation of the TCF7L2 gene in three Amerindian populations and compared the results with whole-exon-sequencing of Mestizo populations from Sigma and the 1000 Genomes Project to assess the roles of selection and recombination in diversity.
The diversity in the indigenous populations was biased to intronic regions. Most of the variation was low frequency. Only mutations rs77961654 and rs61724286 were located on exon 15. We did not observe variation in intronic region 4–6 in any of the three indigenous populations. In addition, we identified peaks of selective sweeps in the mestizo samples from the Sigma Project within this region. By replicating the analysis of association with T2D between case-controls from the Sigma Project, we determined that T2D was most highly associated with the rs7903146 risk allele and to a lesser extent with the other six variants. All associated markers were located in intronic region 4–6, and their r2 values of linkage disequilibrium were significantly higher in the Mexican population than in Africans from the 1000 Genomes Project. We observed reticulations in both the haplotypes network analysis from seven marker associates and the neighborNet tree based on 6061 markers in the TCF7L2 gene identified from all samples of the 1000 Genomes Project. Finally, we identified two recombination hotspots in the upstream region and 3’ end of the TCF7L2 gene.
The lack of diversity in intronic region 4–6 in Indigenous populations could be an effect of selective sweeps generated by the selection of neighboring rare variants at T2D-associated mutations. The survivors’ variants make the intronic region 4–6 the area of the greatest population differentiation within the TCF7L2 gene. The abundance of selective peak sweeps in the downstream region of the TCF7L2 gene suggests that the TCF7L2 gene is part of a region that is in constant recombination between populations.
Type 2 diabetes (T2D) is a leading global health problem. The diabetic population is estimated to increase from 366 million to 552 million by the year 2030. T2D-related genetic variants explain only a small portion of the heritability of T2D. Ethnic differences in the prevalence of T2D may be attributable to distinct genetic backgrounds of these populations. High-throughput sequencing and genotype data for ethnic groups other than those in which most of the known variants have been discovered will enable the discovery of new variants and an exploration of their role in T2D . Genetic variants of TCF7L2 have been associated with T2D in nearly all ethnic groups studied [2–7]. TCF7L2 is a crucial transcription factor of the Wnt signaling pathway; this pathway plays a role in development and comprises a complex network of protein interactions that regulate cellular intercommunication at multiple levels . TCF7L2 is most strongly associated with T2D in populations of Caucasian origin . Subsequent whole gene analyses of different populations, such as Chinese [9, 10] and Mexican American populations , have revealed other variants associated with T2D. In addition, polymorphisms in the TCF7L2 gene that are associated with T2D exhibit patterns of variation and divergence in concordance with natural selection in Caucasian populations , but the relationship of this pattern to metabolic characteristics is not completely understood . The effect of natural selection on this gene in a Latin population is not known. Therefore, in this study, we investigated new variants in the TCF7L2 gene in a Mexican population and analyzed the selective sweep patterns of the TCF7L2 gene polymorphisms using exome-sequencing data from the Sigma Project and 1000 Genomes Project.
Mutations in the Mexican population
TCF7L2 is a transcription factor in the Wnt signaling pathway and is expressed in many tissues, including fat, liver and pancreatic islets of Langerhans . Genetic variations of the TCF7L2 gene, either in introns or exons, have been associated with T2D, and these markers have been observed in several individuals from different populations, including the Mexican population [2, 15, 16]. Risk alleles associated with gestational diabetes have also been identified in the Mexican population, but the degree of Amerindian ancestry in the subjects of these studies was unknown . Therefore, to estimate the variability of the TCF7L2 gene in a Mexican population with high Amerindian ancestry, we sequenced three groups of indigenous people and one group of mixed race from Mexico (See Methods) comprising 60 Mazatecan, 34 Nahua, 30 Zapotecan and 30 Mestizos.
Although the primers were designed primarily for exons of the TCF7L2 gene (See Methods), most of the variation identified in the Mexican population occurred in introns, and only six variants were shared with the 1000 Genomes Project and the Sigma Project (Table 1). We scanned the diversity of the TCF7L2 gene in 4,060 exomes (see Methods) and identified only 80 exonic variants, corresponding to a nucleotide diversity of 0.00147 (data not shown). The low diversity is also reflected in the allelic frequency of the rs77961654 (C/A) missense mutation in exon 15, which was observed in the Indigenous (Mexican) population, the 1000 Genomes Project, and the Sigma Project (both cases and controls). The minor allelic frequency (MAF) of rs77961654 (C/A) was lower in the Sigma Project (0.0896552) than in the 1000 Genomes Project (0.129793). Similarly, the MAF of the rs61724286 (C/G) polymorphism in exon 15 of the TCF7L2 gene in the Indigenous population (Table 1) was lower in the Sigma Project (0.000985222). The low MAF of the variants in exon 15 could suggest a purge of diversity in this region; however, the intronic variants identified in the Indigenous populations, which were also observed in the 1000 Genomes Project, exhibited a heterogeneous MAF. In particular, variants rs176632 (T/C) and var_10_114919595 (C/T) in intron 9 exhibited MAFs of 0.992412 (high) and 0.002396 (low), respectively. This heterogeneity in MAF (0.1984 and 0.000399) was also observed in the variants rs56913138 (G/T) and rs188695269 (G/A) in intron 6 (Table 1) and could be a footprint of selective processes of ancestral variants.
The majority of mutations identified in the Indigenous population were new and were located in noncoding regions (Table 2), particularly in introns 4, 6, 7, 11, 13, and 15, it was not possible to calculate the MAF based on the 1000 Genomes Project or Sigma Project. Due to the low number of mutations observed in the Indigenous population, these variations can be referred to as rare genetic variants. It has been hypothesized that mutations with a MAF of 0.5 – 1 % (lower frequency) and rare variations explain a substantial fraction of the heritability of common, complex diseases . In the exome-sequencing, the excess of rare variations was consistent with explosive human population growth , and most of the mutations with functional impact were maintained by weak purification . We assessed the functional impact of two exonic variants (rs77961654 and rs61724286) in the Indigenous population (see Methods) and did not find a deleterious impact of the TCF7L2 gene in either case. The lack of deleterious mutations (modify the function of the protein) suggests some type of selection inside of the gene.
Selection and recombination
To observe selection effects, we analyzed the sweep patterns in a region that included 1 megabase around the TCF7L2 gene using data from the Sigma Project (see Methods). The selective sweeps predict the pattern of excessive linkage disequilibrium (LD) in each of the two regions that flank a recently fixed beneficial mutation by strong positive selection . The Cross Population Extend Haplotype Homozygosity Test (XPEHH) is highly powerful in detecting those with approximately fixed or fixed selected loci  and given the association of LD patterns and selective sweeps, we calculated statistics r2 and XP-EHH (see Methods) and we found that the values significant of r2 (higher > 0,6) in Latino population, flanked to regions with a strong positive selection (values significant of XP-EHH). It is worth mentioning that most of selected regions only in the Latino population (values positive of XP-EHH) were localized in regions adjacent to TCF7L2 gene. The upstream region included the genes MIR4295 and VTI1A, while the downstream region harbored the genes DCLRE1A and NHLRC2. This downstream region was located within of a block hard selection sweeps, which covers several genes involved in metabolism, for example, genes PLEKHS1 and MIR4483. However, the values significant of r2 were found in a region of soft selection sweeps in TCF7L2 gene (Additional files 1, 2 and 3) . In a context of disease association, the r2 statistic is often used in calculations of power to detect disease-susceptibility loci .
In making the case–control association study using the data from the Sigma Project (see Methods), we found several risk alleles reported before (Fig. 1). For example, the rs7903146 risk allele exhibited a high association with T2D and has been associated with Mexican and Asian populations [1, 24]. Other possible risk alleles (rs7896811, rs11196199, rs114863326, rs11196203, rs72826075 and rs35011184) were located within the cutoff limits from the Manhattan plot, and although no studies have linked these mutations with T2D, a duplication of this region (including the mutations described above) has been associated with intellectual and developmental disabilities . The values of r2 in the intronic regions (1 to 4) of the TCF7L2 gene were less than 0.4 when we used only the African population. By contrast, we observed an increase in the values of r2 (Fig. 1b) in the same region when we used only the American population, and the range of values of r2 was 0.1 to 1. This increase in r2 could be related to the association with T2D. Moreover, due to the characteristics of the Sigma Project samples (ancestry controlled), the correlation between the loci associated with T2D in the intronic regions could be under selective constraints in the Amerindian populations. The TCF7L2 gene is located between two recombination hotspots (solid blue line in Fig. 1a and b). The first hotspot is upstream of the TCF7L2 gene and begins in the VTI1A and LOC103344931 genes. The second hotspot is located in exon 14 of the TCF7L2 gene. Most of the intronic variants from the Indigenous population (Fig. 1c) were located in the second recombination hotspot; however, no selective sweep peak was observed in this hotspot. The sweep peaks in the TCF7L2 gene were located in exon 5 and intronic region 5–6 (Fig. 1c). The exon 5 peak, was flanked by the r2 significant values and it was found in a adjacent region with XP-EHH significant values (Additional files 1, 2 and 3). These sweep peaks were previously associated with T2D and could explain the decrease in variability in the Sigma Project samples and the lack of variability in the Indigenous population in this particular region.
Across the entire region analyzed we observed several recombination hotspot peaks (solid blue line in Fig. 1c); however, we identified overlap between the selective sweep peaks (identified for several methods, Additional files 1, 2 and 3) and the recombination hotspots (elliptic dashes in Fig. 1c). The first zone of overlap was located between the upstream and downstream regions of the ZDHHC6 gene and included two hotspots. The second zone of overlap was larger than the first and contained the HABP2, NRAP, CASP7, PLEKHS1, MIR4483, DCLRE1A and NHLRC2 genes. This zone also included two recombination hotspots. The segment delimited by both zones had hits with a GWAS database (Fig. 1c), including associations with breast cancer, T2D and serum dimethylarginine levels. The upstream region of the HABP2 gene is associated with migraines without aura, and the intergenic region between the PLEKHS1 and MIR4483 genes is associated with obesity − related traits. These associations with disease may be due to lower frequencies or the action of purifying selection to maintain the sweep patterns obtained from the recombination.
Genomic distribution of mutations in the Indigenous population
Purifying selection acts not only in the coding regions of the TCF7L2 gene but also in the introns. As a result of selection, we observed only 1,137 variants along two megabases of chromosome 10 (including TCF7L2) of the Sigma Project exomes (data not shown). We observed a decrease in the variability of the TCF7L2 gene (80 variants) in the Indigenous population. The localization of SNPs along TCF7L2 gene was not constant, i.e., regions with low and high density of mutations were observed. The last exons (6–14) of the gene harbored the higher density of SNPs (Fig. 2a). We observed the same distribution of SNPs in the Indigenous population as in the Sigma samples; most of the variants were located in intronic areas. Therefore, we searched the intronic SNPs in RNA databases (see Methods). We did not observe hits of the intronic variants against human mRNAs, despite the higher density poly(A)-sequencing in the last exon of the TCF7L2 gene (Fig. 2b), where the variation in the Indigenous population was concentrated. The same result was obtained when we searched intron variants in the regulatory sites database at http://mirdsnp.ccr.buffalo.edu/browse-genes.php. However, more than one causal variant associated with T2D was located in introns 4–6; for example, the rs7903146 risk allele (described above) had an effect on enhancer activity, resulting in higher TCF7L2 gene expression . Comparing the sequence of the TCF7L2 gene against the Neanderthal genome (http://neandertal.ensemblgenomes.org/Homo_sapiens/Location/Genome) (Fig. 2c) revealed a random distribution of variants. The intronic region (4 to 6) provided an agglomeration of selected mutations (selection scanning). This agglomeration may suggest selection before the Neanderthal-human speciation. At the human population level, the intronic region of 4 to 6 exhibited greater differentiation in SNP patterns (bottom of Fig. 2d). This differentiation could be the product of the action of selective sweeps in fixing low-frequency mutations associated with T2D.
Haplotype patterns and phylogenetic networks
Allelic frequencies are subject to selective and recombination processes. For example, the diversity pattern of the rs7903146 risk allele suggests a positive selective sweep that may have rapidly driven fixation in East Asians and decreased the frequencies in Europeans and Africans . We observed an extension of the positive selective sweep near the rs7903146 SNP comprising the intronic region 4 to 6, and a possible effect of this extension would be a decrease in the frequencies of the SNPs associated with T2D (rs7896811, rs11196199, rs114863326, rs11196203, rs72826075 and rs35011184). To determinate the relationships among the different haploid genotypes associated with T2D, we performed a network haplotype (see Methods). For this analysis, we included the genotypes of macaque, orangutan and chimpanzee, with the latter as the outgroup (Root). Although we included a root (Outgroup) on network (Fig. 3), all internal nodes exhibited reticulations, hindering the interpretation of the hypothetical ancestors. The internal nodes that surrounded the root were grouped (green dashes) by their haplotype profiles. The European haplotype pattern presented major frequencies, including at the rs7903146 SNP. Other groups (Fig. 3, blue dashes) were formed from haplotype 00100, producing nodes with lower frequencies in the African populations. The rs7903146 risk allele was located in all networks, suggesting that it is an ancestral haplotype and possibly fixed in all populations; however, this risk allele was not derived from the root (chimpanzee). By contrast, the possible rs114863326 risk allele was observed between the nodes from the chimpanzee and macaque and was located only in the African population group (blue dashes). The predominance of rs114863326 within the African group would suggest a possible selective sweep to each population mediated by intra-species recombination.
The reticulations on the internal nodes are produced by recombination effects. The rapid growth of the human population also facilitated the spread of rare variants among different races. The presence of recombination hotspots inside and outside the TCF7L2 gene suggests constant genetic flow, and the phylogenetic interpretation is consequently unclear. We therefore constructed a NeighborNet tree (see Methods) based on the variation obtained from the 1000 Genomes Project. As shown in Fig. 4, two principal groups were formed, each including several types of human populations. This phylogenetic incoherence was derived from the reticulations (dashes red) within and between both groups. Despite the reticulations, small groups maintained coherence (Fig. 4), for example, the Indian population with cluster A (ITU, Karimnagar) and B (GIH-ITU, Gandhinagar and Karimnagar). Cluster C included a Japanese population (JPT). These clusters with phylogenetic coherence suggest that the majority of the observed recombination events are ancient.
TCF7L2 harbors variants with the strongest effect on T2D. The phenotypic changes associated with the risk genotype suggest that T2D arises as a consequence of impairment of several vital functions in the pancreatic islet . In addition, the synthesis and processing of proinsulin and possibly the clearance of proinsulin and insulin are regulated by TCF7L2 . Due to the functional importance of TCF7L2, any mutation in either the coding or noncoding regions could change its function. Recently, the common noncoding variant rs7903146 (risk T-allele) was associated with increased TCF7L2 expression and decreased insulin content and secretion . This variant has been associated with T2D in several ethnic groups, such as Japanese, Indian, Tunisian, European and Mexican [1, 24, 29–31]. By replicating the case–control association analysis from the Sigma Project, the rs7903146 risk allele was also associated with T2D, and other possible risk alleles placed on introns 4–6 (rs7896811, rs11196199, rs114863326, rs11196203, rs72826075 and rs35011184) were located in the limits of the cutoff values of the Manhattan plot. The presence of the rs7903146 SNP in several populations may suggest an ancestral character; however, the association of this variant with T2D in populations with controlled ancestry might suggest an increased rate of recombination in a specific population. The dispersion of the possible risk allele rs35011184 in a given population may be mediated through recombination or by the occurrence of new mutations within the region [32, 33]. Although LD has been reported in the TCF7L2 gene, rs7903146 exhibited moderate LD in the European populations (r2= 0.73) . By contrast, we observed higher LD of the 4–6 intronic region (including rs7903146) associated with T2D (0.5 to 1) in the Latino population (r2= 0.5 to 1) than the African population (r2= 0.2 to 0.5), and the SNPs in LD were not associated with T2D, suggesting a vertical inheritance of genotypes.
Phylogenetic analysis of exon 4 suggested that the rs7903146 risk allele is an ancestral character and tends to recombine with a variety of haplotype backgrounds, particularly in the relatively diverse West African population ; however, the analyses of Network median joining of the intronic variants indicated an increase in the recombination of rs7903146 because it was mainly placed in two of the groups. Moreover, the possible risk allele rs114863326 also exhibited ancestral character and was located next to the root taxon (chimpanzee). Notably, this variant was only localized in a cluster of the African population, suggesting that this variant tends to recombine with this group. The increase in reticulations in the haplotype network in the intronic regions could indicate a decrease in selection forces compared to the coding regions; however, we did not observe differences in the reticulations of the neighborNet tree from 6,061 variants (in the TCF7L2 gene) in mostly exonic regions obtained from the 1000 Genomes Project. The only phylogenetic consistency was observed in small groups belonging to the same population. The excess of phylogenetic inconsistency across the TCF7L2 gene could be the result of recombination hotspots in the surrounding region (1 Mb) of the TCF7L2 gene. In particular, two recombination peaks flank the gene at the 5’ and 3’ ends. Some apparent recombination hotspots could be an artifact of variable population size over time (demographic effects) because these hotspots were identified assuming a particular model of human history, and one of the assumptions of that model is that the effective population size of humans has remained constant throughout history . The inferred recombinant hotspots in gene regions that also appear to have undergone positive selection  may not be due to non-uniform densities of meiotic recombination but may simply be a by-product of positive selection. Therefore, selective sweeps do make a significant contribution to the patterns of LD in the human genome .
Selective sweeps can increase the genetic differentiation among populations, cause allele frequency spectra to depart from the expectation under neutrality, and also alter the allele frequencies of single nucleotide polymorphisms (SNPs) in the vicinity of the selected allele . We observed two peaks of selective sweeps in exon 5 and intron region 5–6 of the TCF7L2 gene in samples from the Sigma Project with Amerindian ancestry greater than 70 %. This peak is within intron region 4–6 (associated to T2D), which covers approximately 81 % (178,063 bp) of the TCF7L2 gene. Because more than half of the gene is constantly under sweep selection, the reduction or elimination of variation among the nucleotides in neighboring regions could explain why most of the mutations (40 variants) observed in the Amerindian population occurred in intron 7 to 15 and why only six mutations were observed in introns 4 and 6. Forty of the mutations were not observed in the 1000 Genomes Project or the Sigma Project, and all had a low frequency (rare variants). Moreover, four of the variants were identified in the 1000 Genomes Project, and the rs176632 (T/C) and rs56913138 (G/T) SNPs exhibited a MAF of 0.99 and 0.19, respectively. In the Sigma Project, the two mutations (rs77961654 C/A and rs61724286 C/G) in exon 15 exhibited the lowest MAF of 0.08 and 0.0009, respectively. All mutations in the Amerindian population identified in both databases (1000 Genomes Project and Sigma Project) are of ancestral character and survived genetic drift, possibly by strong positive selection. Most of the peaks of selective sweeps were identified downstream of the TCF7L2 gene, including the following genes: HAPB2, NRAP, CASP7, PLEKHS1, MIR4483, DCLRE1A and NHRLC2. This region was associated with diseases from the GWAS catalog such as obesity-related traits and Gaucher disease severity. In addition, the peaks of selective sweeps overlapping with recombination hotspots could influence the selection of the TCF7L2 gene.
The impact of the TCF7L2 gene on metabolic diseases is fundamental because this gene has multiple targets in the pathways for insulin synthesis . The hepatic expression of TCF7L2 is decreased in diabetic patients as their body mass index (BMI) increases . Moreover, studies in mice have revealed that the regulation of TCF7L2 expression varies in different tissues. In mice fed a high fat diet (HFD), the expression levels of TCF7L2 were upregulated in pancreatic islets but down-regulated in the liver . Furthermore, the TCF7L2 gene displays a complex pattern of spliced variants, with several alternative exons and splice sites, resulting in splice variants that are distributed and expressed in a tissue-specific manner [41–44]. The presence of the rs61724286 (C/G) splice variant identified in Zapotecan and mixed race (group with less Amerindian ancestry) populations in conjunction with the rs77961654 (C/A) missense variant (present in three indigenous populations) may not be tolerable for function of the TCF7L2 gene. The function of this gene is also impacted by the high-fat, caloric-dense diets and sedentary lifestyles common in our society. Although most mutations surviving the processes of selective sweeps in the Indigenous populations were intronic and had no immediate functional impact, the accumulation of mutations, particularly in exon 15, could have a damaging effect on the regulation of metabolism by the TCF7L2 gene.
The TCF7L2 gene is the locus most strongly associated with T2D risk in different populations . In addition, evidence of LD and mutations in the TCF7L2 gene  that are under selective processes exist ; however, ancestry has not been controlled in most studies, and evolutionary processes in Amerindian populations have not been evaluated. In this work, we identified and analyzed the nucleotide variants in three indigenous populations: Mazatecan, Nahua and Zapotecan. Although the number of sampled individuals in each of the three groups was low (30–60) and varied, the high Amerindian ancestry of the indigenous population allowed us to detect changes in the allelic frequencies in the Mexican population (Sigma Project) and the rest of the human population (1000 Genomes Project). Thus, we observed an asymmetric distribution of nucleotide variants in the 5’ and 3’ ends of the TCF7L2 gene. Most of the mutations were located within intron 6–7 onwards. The only two coding mutations were located in exon 15. The lack of diversity in the Indigenous populations in intronic regions 1–6 could be due to a selective sweep because we identified two selection peaks in the samples from a Mexican Mestizo population (Sigma Project) with Amerindian ancestry higher than 70 %. The first peak was located in exon 5, and the second was located in intron 5–6. Additionally, the two selective peaks were placed within the area (intronic 4–6) containing mutations associated with T2D, including the rs7903146 risk allele, which was the most frequent allele not only in the samples from the Sigma Project but also in the various human populations. The reticulations identified in the network analysis, either in the variants associated with T2D or all of the variants of the TCF7L2 gene in the samples of the 1000 Genomes Project, suggest that the dispersion of rs7903146 in human populations is due to recombination. This hypothesis is supported by the recombination hotspots located in the upstream region and 3’ end of the TCF7L2 gene. Furthermore, the overlap of the sweep peaks with the recombination hotspots in the area downstream of the gene, specifically in an area associated with obesity-related traits and Gaucher disease severity, indicates that the TCF7L2 gene is part of a strongly selected region in which rare variants that survive selective sweep could be genetic sources that modify metabolism.
The Amerindian population, based on self-reporting, comprised 60 Mazatecan from San Lorenzo Cuaunecuiltitla, Oaxaca (Mexico), 34 Nahuas from Xoteapa, Veracruz (Mexico) and 30 Zapotecan from the Mexican Diversity Project. Henceforth, these populations are referred to as the Indigenous population. Another group of 30 mixed-race Mexicans (Mestizo population) from the Mexican Diversity Project  participated. All individuals gave informed consent before they were included in the study. Mestizo and Zapotecan participants gave written consent. Verbal consent from Nahua population was taken and written consent from Mazatecan population was taken. Mestizos and Zapotecan samples were taken with anonymazed consent; All the individuals were 18 years or older. The age range was: Nahua from 37 to 81 years old, Mazatecan from 22 to 78 years old.
DNA was extracted using the QIAamp DNA Blood Maxi Kit, QIAGEN. We designed primers for each exon that also covered part of the introns surrounding the exon (Additional file 4). Sequencing was performed using the Sanger method on an Applied Biosystems sequencer (3730xl DNA Analyzer) with Big dye terminator (standard method).
The variability (SNPs) of chromosome 10 was obtained from the 1000 Genomes Project ((v4.20130502) ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/). This project contains five super-populations: African (AFR), Ad Mixed American (AMR), East Asian (ASN), European (EUR), and South Asian (SAN). The numbers of sampled individuals were 661, 347, 504, 503 and 489, respectively. The family relationship is included in the *.bed files along with the gender and geographic origin of the individuals.
To assess variability inside the Latino population, particularly on chromosome 10, we used the Sigma Project database , which includes 4,060 whole-exome sequences of individuals. This database consists of 1954 controls, 1905 T2D cases and 201 individuals without ascertainable clinical histories. The Sigma Project comprises four databases: DMS (Diabetes in Mexico Study), MCDS (Mexico City Diabetes Study), MEC (Multiethnic Cohort) and UIDS (UNAM/INCMNSZ Diabetes Study). We used the UIDS database because of the protocols used to identify individuals with T2D. Cases were recruited at the outpatient diabetes clinic of the Department of Endocrinology and Metabolism of the Instituto Nacional de Ciencias Médicas y Nutrición Salvador Zubirán (INCMNSZ). An important feature of this database is that the Amerindian ancestry of the individuals was higher than 70 % (ancestry was calculated previously by exome-chip sequencing). A high Amerindian ancestry eliminates the influence of other populations in the genetic analysis. The analyzed sequence included only the coding regions (exome-sequencing) and also spanned the intronic regions adjacent to exons.
Determination of SNPs
The quality of Sanger sequencing (q-value) of the Indigenous populations was assessed by the fastqc program , with pretreatment of the reads using in-house Python scripts. The sequencing chromatograms were analyzed by the PolyPhred program . We used a minimum q-value of 20 and the following parameters: −f 16, −score 90, −window 15. The output PolyPhred was processed by perl scripts  for mutation screening. The dbSNPs database v. 138  was used to annotate the SNPs identified by the perl scripts.
Association of case–control
Replication of the association test of case–control for T2D in Latinos with Amerindian ancestry greater than 70 % (described above) was performed by the plink program  with the parameters provided in the Sigma Project .
Genomic region for evolutionary analysis
The area of study for selective sweep analysis and recombination focused only on chromosome 10 because this region harbors the TCF7L2 gene. The region was defined by referring to the human genome (hg19 version) and delimited with the coordinates: 113,925,369 – 115,925,369. This region encompasses the following genes: GPAM, TECTB, MIR6715A, MIR6715B, GUCY2GP, ACSL5, ZDHHC6, VTI1A, MIR4295, LOC103344931, TCF7L2, HABP2, NRAP, CASP7, PLEKHS1, MIR4483, DCLRE1A, NHLRC2, ADRB1 and CCDC186.
Identification of selective sweeps
Based on the variation identified in chromosome 10 from the whole-exome-sequencing of the Sigma Project (described above), we constructed a nucleotide sequence composed mainly of variation (SNPs) for each individual. If a mutation was not observed at a particular position for an individual in the alignment of all samples, we used the existing nucleotide in the hg19 reference genome (The Genome Sequencing Consortium). This process comprised two parts. First, we extracted the nucleotide variation for each individual using the pseq program  with flags v-view --geno –mask. Next, the nucleotides were concatenated through perl scripts without losing the order of chromosome 10. In the second part, we revised and formatted the alignment to analyze selective sweeps using perl scripts. Selective sweeps were identified by traditional and whole-genome tools.
The traditional tests used: iHS , XPEHH  and nSL . The iHS scores were calculated and normalized using selscan program  with parameters: −-max-extend 1000000 (maximum EHH extension in bp), −-max-gap 300000, −-gap-scale 30000, −-cutoff 0.05 (EHH decay cutoff) and --alt. We used the recombination map of Altshuler . The output results for each SNP were then frequency-normalized over chromosome 10 using the script norm, provided with Selscan. This normalization was also done using default parameters: −-bins 100 (number of frequency bins). The fractions of SNPs with |iHS| values above 2.0 were retained.
The nSL scans were performed and normalized using Selscan program  following the same procedures used for iHS. The iHS statistic compares the integrated EHH (Extended Haplotype Homozygosity) profiles between two alleles at a given SNP in the same population. It is noteworthy that the nSL test uses a slightly different approach to screen for EHH than the original iHS test and is robust to variation in recombination frequency, also identify selective peaks under a number of different selection scenarios, most notably in the cases of sweeps from standing variation and incomplete sweeps .
The XP-EHH (Cross Population Extended Haplotype Homozygosity) statistic compares the integrated EHH profiles between two populations at the same SNP  and is expected to be more reliable if a reference population with a similar demographic history is available, and if the allele under selection is close to fixation in one of the populations. XP-EHH score were calculated by Selscan program  with parameters: −-xpehh –max-gap 300000 –gap-scale 30000 and --alt. We compared Latino population vs African population reference of the 1000 Genomes Project. The scores for each SNP were then frequency-normalized over chromosome 10 using the script norm, provided with Selscan. We fit the XPEHH scores distribution using General model Gauss with one component by Matlab program . Since that the negative XPEHH suggest selection happened in reference population, we identified the two Gauss distribution tails with 95 % confidence. SNPs passing positive threshold are candidates for selection in Latino population and those passing negative threshold are in African population.
The whole-genome tools: The ω statistic proposed by Kim and Nielsen  was able to accurately localize the selective sweeps by a likelihood (ML) framework that uses linkage-disequilibrium (LD) information. OmegaPlus program  has a high-performance implementation of the ω statistic at genome level and was utilized for scan to chromosome level the selective sweeps, with parameters -grid 47370, −minwin 1, −maxwin 5, −threads 20, −impute N, −seed 20, and -all. The peaks of selective sweeps from OmegaPlus were those that exceeded the upper confidence interval calculated by the function dfittool from the Matlab program .
Calculation of linkage disequilibrium (LD) and integration of genetics analysis
LD was calculated for chromosome 10, particularly for the regions of the TCF7L2 gene. We extended this region 1000 kb downstream and upstream of the TCF7L2 gene (described above). The plink program  was used with default parameters; however, to rule out the influence of non-Latino populations, we calculated LD based on the African population of the 1000 Genomes Project (because it is the oldest population) and another analysis with the Latino population only.
The mutations from the Sigma Project localized to the extended region (described above) and were searched against a GWAS database to identify the diseases associated with this genomic region. Additionally, the peaks of selective sweeps (determined above) and mutations of the Indigenous population previously identified were integrated by the LocusZoom program .
Haplotype network and phylogenetic network analysis
To determine the relationships among the different genotypes associated with T2D (describe above), we performed Network median joining using the Network program . We used the following modules: Start contraction, Median joining and MP calculation. For the Outgroup, we included orangutan (GCA_000001545.1), chimpanzee (GCA_000001515.4) and macaque (MMUL 1.0) sequences. Mutations in the TCF7L2 gene (6,061 mutations) from the 1000 Genomes Project were concatenated, and a fasta file was constructed. Next, a neighborNet tree was constructed by the Splistree program  using default parameters.
Functional impact of mutations
The mutations were analyzed based on their genomic coordinates in the ANNOVAR  and SNPSIFT  databases. Both databases include several algorithms (nine in all) that predict the functional effect. We selected a mutation if it was positive in at least 4 algorithms. The filters were numbers or letters as follows: POLYPHEN2 (“P and D”), LRT (“D” and “ < 0.0001”), MutationTaster (“A and D”), MutationAssessor (“high, medium and low”), FATHM (“ < −1.5”), GERP++ (“ > = 5”), SIFT (“ < 0.05”), phylop (“ > = 3”) and PROVEAN (“ < −2.5 Deleterious probably” and “ < −4.1 Deleterious”).
Effects of mutations on gene expression
The SNPs could create, destroy, or modify the efficiency of miRNA binding to the 3’UTR of a gene, resulting in gene dysregulation. Consequently, we explored each mutation identified in the Indigenous population in the following databases: miRdSNP , mrSNP  and RNAsnp . We plotted our results using the UCSC browser (http://genome.ucsc.edu/cgi-bin/hgGateway).
Ethics approval and consent to participate
This study was approved by the Research Committee, Research Ethics Commitee and Institutional Biosafety Commitee from the National Institute of Genomic Medicine. All individuals gave informed consent before they were included in the study. Mestizo and Zapotecan participants gave written consent; Mestizos and Zapotecan samples were taken with anonymized consent; Verbal consent from Nahuatl population was taken and written consent from Mazatecan population was taken. The consent from Nahuatl population was verbal, because the samplings were made with the consent of the leaders of the ethnic group and each participant gave his approval to participate in the study. The oral approval was obtained with a translator and answers were given to all questions about the investigation.
genome-wide association study
minor allele frequency
reference allele/mutated allele
minimum frequency to the reference
single nucleotide polymorphism
type 2 diabetes mellitus
SIGMA Type 2 Diabetes Consortium, Williams AL, Jacobs SB, Moreno-Macías H, Huerta-Chagoya A, Churchhouse C, et al. Sequence variants in SLC16A11 are a common risk factor for type 2 diabetes in Mexico. Nature. 2014;506:97–101.
Grant SF, Thorleifsson G, Reynisdottir I, Benediktsson R, Manolescu A, Sainz J, et al. Variant of transcription factor 7-like 2 (TCF7L2) gene confers risk of type 2 diabetes. Nat Genet. 2006;38:320–3.
Cauchi S, Meyre D, Choquet H, Dina C, Born C, Marre M, et al. TCF7L2 variation predicts hyperglycemia incidence in a French general population: The data from an epidemiological study on the insulin resistance syndrome (DESIR) study. Diabetes. 2006;55:3189–92.
Weedon MN. The importance of TCF7L2. Diabet Med. 2007;24:1062–6.
Cauchi S, El Achhab Y, Choquet H, Dina C, Krempler F, Weitgasser R, et al. TCF7L2 is reproducibly associated with type 2 diabetes in various ethnic groups: A global meta-analysis. J Mol Med (Berl). 2007;85:7770782.
Palizban A, Nikpour M, Salehi R, Maracy MR. Association of a common variant in TCF7L2 gene with type 2 diabetes mellitus in a Persian population. Clin Exp Med. 2012;12:115–9.
Martínez-Gómez LE, Cruz M, Martínez-Nava GA, Madrid-Marina V, Parra E, García-Mena J, et al. A replication study of the IRS1, CAPN10, TCF7L2, and PPARG gene polymorphisms associated with type 2 diabetes in two different populations of Mexico. Ann Hum Genet. 2011;75:612–20.
van Amerongen R, Nawijn MC, Lambooij JP, Proost N, Jonkers J, Berns A. Frat oncoproteins act at the crossroad of canonical and noncanonical Wnt-signaling pathways. Oncogene. 2010;29:93–104.
Chang YC, Chang TJ, Jiang YD, Kuo SS, Lee KC, Chiu KC, et al. Association study of the genetic polymorphisms of the transcription factor 7-like 2 (TCF7L2) gene and type 2 diabetes in the Chinese population. Diabetes. 2007;56:2631–7.
Ng MC, Tam CH, Lam VK, So WY, Ma RC, Chan JC. Replication and identification of novel variants at TCF7L2 associated with type 2 diabetes in Hong Kong Chinese. J Clin Endocrinol Metab. 2007;92:3733–7.
Lehman DM, Hunt KJ, Leach RJ, Hamlington J, Arya R, Abboud HE, et al. Haplotypes of transcription factor 7-like 2 (TCF7L2) gene and its upstream region are associated with type 2 diabetes and age of onset in Mexican Americans. Diabetes. 2007;56:389–93.
Helgason A, Pálsson S, Thorleifsson G, Grant SF, Emilsson V, Gunnarsdottir S, et al. Refining the impact of TCF7L2 gene variants on type 2 diabetes and adaptive evolution. Nat Genet. 2007;39:218–25.
Ahlzén M, Johansson LE, Cervin C, Tornqvist H, Groop L, Ridderstråle M. Expression of the transcription factor 7-like 2 gene (TCF7L2) in human adipocytes is down regulated by insulin. Biochem Biophys Res Commun. 2008;370:49–52.
Osmark P, Hansson O, Jonsson A, Rönn T, Groop L, Renström E. Unique splicing pattern of the TCF7L2 gene in human pancreatic islets. Diabetologia. 2009;52:850–4.
Humphries SE, Gable D, Cooper JA, Ireland H, Stephens JW, Hurel SJ, et al. Common variants in the TCF7L2 gene and predisposition to type 2 diabetes in UK European Whites, Indian Asians and Afro-Caribbean men and women. J Mol Med (Berl). 2006;84:1005–14.
Miyake K, Horikawa Y, Hara K, Yasuda K, Osawa H, Furuta H, et al. Association of TCF7L2 polymorphisms with susceptibility to type 2 diabetes in 4,087 Japanese subjects. J Hum Genet. 2008;53:174–80.
Reyes-Lopez R, Perez-Luque E, Malacara JM. Metabolic hormonal characteristics and genetic variants of TCF7L2 associated with development of gestational diabetes mellitus in Mexican women. Diabetes Metab Res Rev. 2014;8:701–6.
McClellan J, King MC. Genetic heterogeneity in human disease. Cell. 2010;141:210–7.
Coventry A, Bull-Otterson LM, Liu X, Clark AG, Maxwell TJ, Crosby J, et al. Deep resequencing reveals excess rare recent variants consistent with explosive population growth. Nat Comm. 2010;1:131.
Tennessen JA, Bigham AW, O’Connor TD, Fu W, Kenny EE, Gravel S, et al. Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science. 2012;337:64–9.
Kim Y, Nielsen R. Linkage disequilibrium as a signature of selective sweeps. Genet. 2004;167:1513–24.
Sabeti PC, Reich DE, Higgins JM, Levine HZ, Richter DJ, Schaffner SF, Gabriel SB, Platko JV, Patterson NJ, McDonald GJ. Detecting recent positive selection in the human genome from haplotype structure. Nature. 2002;419(6909):832–7.
VanLiere JM, Rosenberg NA. Mathematical properties of the r2 measure of linkage disequilibrium. Theor Popul Biol. 2008;74:130–7.
Diabetes Genetics Replication And Meta-analysis (DIAGRAM) Consortium, Asian Genetic Epidemiology Network Type 2 Diabetes (AGEN-T2D) Consortium, South Asian Type 2 Diabetes (SAT2D) Consortium, Mexican American Type 2 Diabetes (MAT2D) Consortium, Type 2 Diabetes Genetic Exploration by Nex-generation sequencing in muylti-Ethnic Samples (T2D-GENES) Consortium, Mahajan A, et al. Genome-wide trans-ancestry meta-analysis provides insight into the genetic architecture of type 2 diabetes susceptibility. Nat Genet. 2014;46:234–44.
Kaminsky EB, Kaul V, Paschall J, Church DM, Bunke B, Kunig D, et al. An evidence-based approach to establish the functional and clinical significance of copy number variants in intellectual and developmental disabilities. Genet Med. 2011;13:777–84.
Ruiz-Narváez EA. Redundant enhancers and causal variants in the TCF7L2 gene. Eur J Hum Genet. 2014;22:1243–6.
Hansson O, Zhou Y, Renström E, Osmark P. Molecular function of TCF7L2: Consequences of TCF7L2 splicing for molecular function and risk for type 2 diabetes. Current Diabetes Reports. 2010;10:444–51.
Zhou Y, Park SY, Su J, Bailey K, Ottosson-Laakso E, Shcherbina L, et al. TCF7L2 is a master regulator of insulin production and processing. Hum Mol Genet. 2014;23:6419–31.
Hara K, Fujita H, Johnson TA, Yamauchi T, Yasuda K, Horikoshi M, et al. Genome-wide association study identifies three novel loci for type 2 diabetes. Hum Mol Genet. 2014;23:239–46.
Saxena R, Saleheen D, Been LF, Garavito ML, Braun T, Bjonnes A, et al. Genome-wide association study identifies a novel locus contributing to type 2 diabetes susceptibility in Sikhs of Punjabi origin from India. Diabetes. 2013;62:1746–55.
Turki A, Al-Zaben GS, Mtiraoui N, Marmmuoch H, Mahjoub T, Almawi WY. Transcription factor-7-like 2 gene variants are strongly associated with type 2 diabetes in Tunisian Arab subjects. Gene. 2013;513:244–8.
Reich DE, Cargill M, Bolk S, Ireland J, Sabeti PC, Richter DJ, et al. Linkage disequilibrium in the human genome. Nature. 2001;411:199–204.
Reich DE, Lander ES. On the allelic spectrum of human disease. Trends Genet. 2001;17:502–10.
Johnson AD, Handsaker RE, Pulit SL, Nizzari MM, O’Donnell CJ, de Bakker PI. Snap: A web-based tool for identification and annotation of proxy SNPs using HapMap. BioInformatics. 2008;24:2938–9.
Johnston HR, Cutler DJ. Population demographic history can cause the appearance of recombination hotspots. Am J Hum Genet. 2012;90:774–83.
Verrelli BC, Tishkoff SA. Signatures of selection and gene conversion associated with human color vision variation. Am J Hum Genet. 2004;75:363–75.
Reed FA, Tishkoff SA. Positive selection can create false hotspots of recombination. Genet. 2006;172:2011–4.
Chen H, Patterson N, Reich D. Population differentiation as a test for selective sweeps. Genome Res. 2010;20:393–402.
Hindle AK, Brody F, Teva R, Kluk B, Hill S, McCaffrey T, et al. TCF7L2 expression in diabetic patients undergoing bariatric surgery. Surg Endosc. 2009;23:700–4.
Yang H, Li Q, Lee JH, Shu Y. Reduction in Tcf7l2 expression decreases diabetic susceptibility in mice. Int J Biol Sci. 2012;8:791–801.
Prokunina-Olsson L, Welch C, Hansson O, Adhikari N, Scott LJ, Usher N, et al. Tissue-specific alternative splicing of TCF7L2. Hum Mol Genet. 2009;18:3795–804.
Weise A, Bruser K, Elfert S, Wallmen B, Wittel Y, Wöhrle S, et al. Alternative splicing of Tcf7l2 transcripts generates protein variants with differential promoter-binding and transcriptional activation properties at Wnt/beta-catenin targets. Nucleic Acids Res. 2010;38:1964–81.
Mondal AK, Das SK, Baldini G, Chu WS, Sharma NK, Hackney OG, et al. Genotype and tissue-specific effects on alternative splicing of the transcription factor 7-like 2 gene in humans. J Clin Endocrinol Metab. 2010;95:1450–7.
Le Bacquer O, Shu L, Marchand M, Neve B, Paroni F, Kerr Conte J, et al. TCF7L2 splice variants have distinct effects on beta-cell turnover and function. Hum Mol Genet. 2011;20:1906–15.
Silva-Zolezzi I, Hidalgo-Miranda A, Estrada-Gil J, Fernandez-Lopez JC, Uribe-Figueroa L, Contreras A, et al. Analysis of genomic diversity in Mexican mestizo populations to develop genomic medicine in Mexico. Proc Natl Acad Sci U S A. 2009;106:8611–6.
Andrews S. A quality control tool for high throughput sequence data. 2010. http://www.bioinformaticsbabrahamacuk/projects/fastqc/. Accessed 15 Jan 2015.
Stephens M, Sloan JS, Robertson PD, Scheet P, Nickerson DA. Automating sequence-based detection and genotyping of SNPs from diploid samples. Nat Genet. 2006;38:375–81.
Nilsson D. A simple pipeline for Sanger read heterozygosity aware mutation screening. 2011. http://github.com/dnil/acrscreen/. Accessed 15 Jan 2015.
Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, et al. DbSNP: The NCBI database of genetic variation. Nucleic Acids Res. 2001;29:308–11.
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, et al. Plink: A tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81:559–75.
Voight BF, Kudaravalli S, Wen X, Pritchard JK. A map of recent positive selection in the human genome. PLoS Biol. 2006;4(3), e72.
Ferrer-Admetlla A, Liang M, Korneliussen T, Nielsen R. On detecting incomplete soft or hard selective sweeps using haplotype structure. Mol Biol Evol. 2014;31(5):1275–91.
Szpiech ZA, Hernandez RD. selscan: an efficient multithreaded program to perform EHH-based scans for positive selection. Mol Biol Evol. 2014;31(10):2824–7.
Altshuler DM, Gibbs RA, Peltonen L, Dermitzakis E, Schaffner SF, Yu F, Bonnen PE, de Bakker PI, Deloukas P, Gabriel SB, et al. Integrating common and rare genetic variation in diverse human populations. Nature. 2010;467(7311):52–8.
Fariello MI, Boitard S, Naya H, SanCristobal M, Servin B. Detecting signatures of selection through haplotype differentiation among hierarchically structured populations. Genetics. 2013;193(3):929–41.
MathWorks I. MATLAB and Statistics Toolbox Release 2012. Natick: The MathWorks, Inc; 2012.
Alachiotis N, Stamatakis A, Pavlidis P. OmegaPlus: A scalable tool for rapid detection of selective sweeps in whole-genome datasets. BioInformatics. 2012;28:2274–5.
Pruim RJ, Welch RP, Sanna S, Teslovich TM, Chines PS, Gliedt TP, et al. LocusZoom: Regional visualization of genome-wide association scan results. BioInformatics. 2010;26:2336–7.
Polzin T, Daneshmand SV. On Steiner trees and minimum spanning trees in hypergraphs. Oper Res Lett. 2003;31:12–20.
Huson DH, Bryant D. Application of phylogenetic networks in evolutionary studies. Mol Biol Evol. 2006;23:254–67.
Wang K, Li M, Hakonarson H. ANNOVAR: Functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010;38, e164.
Cingolani P, Platts A, le Wang L, Coon M, Nguyen T, Wang L, et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly. 2012;6:80–92.
Bruno AE, Li L, Kalabus JL, Pan Y, Yu A, Hu Z. miRdSNP: A database of disease-associated SNPs and microRNA target sites on 3’UTRs of human genes. BMC Genomics. 2012;13:44.
Maragkakis M, Reczko M, Simossis VA, Alexiou P, Papadopoulos GL, Dalamagas T, et al. DIANA-microT web server: Elucidating microRNA functions through target prediction. Nucleic Acids Res. 2009;37(Web Server Issue):W273–6.
Sabarinathan R, Tafer H, Seemann SE, Hofacker IL, Stadler PF, Gorodkin J. The RNAsnp web server: Predicting SNP effects on local RNA secondary structure. Nucleic Acids Res. 2013;41(Web Server Issue):W475–9.
JLA was supported by a DGAPA UNAM postdoctoral fellowship.
This research project was sponsored by Ministry of Health from Mexico with federal funds. The authors wish to thank the design department for their support.
Availability of data and materials
All individuals gave informed consent verbal or written to sharing in form anonymously the reads sequenced. The data to explain and repeat the experiments were uploaded in the SRA repository from NCIB with follow link http://www.ncbi.nlm.nih.gov/Traces/sra/?run=SRR3281287.
TMC and LBP designed the sequencing protocols. ACHM, SNCP, LCAC, BCF and EYRV conducted the sequencing. TTL and CAAS provided the Sigma database. JLA performed the bioinformatic analysis. JLA and LBP conceived the studies and drafted the manuscript. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent for publication
All individuals gave informed consent verbal or written to sharing anonymously the reads sequenced.
Details of selection analysis by several methods. (DOCX 14.8 kb)
Analysis of Selection Sweeps (iHS, nSL, w and XP-EHH). (JPG 421 kb)
GWAS Catalog SNPs in Region. (DOC 61.5 kb)
Primer used to sequencing of TCF7L2 gene exons. (TXT 3.24 kb)
About this article
Cite this article
Acosta, J.L., Hernández-Mondragón, A.C., Correa-Acosta, L.C. et al. Rare intronic variants of TCF7L2 arising by selective sweeps in an indigenous population from Mexico. BMC Genet 17, 68 (2016) doi:10.1186/s12863-016-0372-7
- TCF7L2 gene
- Type 2 diabetes
- Genetic association
- Sweeps selection
- Recombination hotspots