Mesoamerican Totonacs (24) were sampled from an isolated rural location near Filomeno Mata, Veracruz, in southern Mexico. South American Bolivians (28) were obtained from several locations in Bolivia. All subjects were collected as unrelated samples, and all subjects’ grandparents originated from the same geographic region. All samples were collected with informed consent by the Sorenson Molecular Genealogical Foundation (SMGF) as part of a worldwide sample collection project. The study was approved by the Western Institutional Review Board.
Approximately 2 ml of saliva were obtained from each individual using a mouthwash kit. Sample DNA was extracted using a standard alkaline-SDS procedure. Mitochondrial hypervariable segments (HVS) I and II from nucleotide position 16,024 through 576 were determined by Sanger sequencing. Along with basal mtDNA clade variation, pre-Columbian mtDNA lineages were inferred with the following key variants: Haplogroup A: A – 16290 T, 16319A, 235 G; A2 – 16111 T, 146 C, 153 G; Haplogroup B: B – 16189 C; B4 – 16217 C; B4b – 499A, B2 – 16136 T, [16183d]; Haplogroup C: C – 16298 C, 16327 T, 249d; C1 – 16325 C, 290-290d; C1b – 493 G; C1d – 16051 G; Haplogroup D: D – 16362 C; D1 – 16325 C. Haplogroup X was not observed. To assign Y-chromosome lineages, samples were genotyped for 36 Y-chromosome STR loci: DYS385, DYS388, DYS389I, DYS389B, DYS390, DYS391, DYS392, DYS393, DYS394, DYS426, DYS437, DYS438, DYS439, DYS441, DYS444, DYS445, DYS446, DYS447, DYS448, DYS449, DYS452, DYS454, DYS455, DYS456, DYS458, DYS459, DYS460, DYS461, DYS462, DYS463, DYS464, GGAAT1B07, YCAII, YGATAA10, YGATAC4, and YGATAH4. The Bolivians were typed for 11 additional Y-SNPs: M172, M173, SRY10831.2, M124, M122, M3, M74, M9, M20, M216, and M89. Y-chromosome lineages were assigned probabilistically using 35 (of the 36) STR loci . Haplogroups for the Bolivians were verified or further resolved with the 11 additional Y-chromosome SNPs. All Totonac lineages were verified with Y-chromosome SNPs M242 and M3.
Autosomal SNP data were generated using Affymetrix 6.0 microarrays. Three Bolivians with European Y-haplogroups (G and J) were removed prior to microarray genotyping. Two-hundred thirteen SNPs showing strong deviation (p < 5.5 x10−8) from Hardy-Weinberg expectations were removed as previously described . Pairwise genetic distances were estimated as the average fraction of alleles shared between two individuals over all loci. Two pairs of Bolivians had allele sharing genetic distances of < 0.13, suggesting relatedness . One sample from each of these pairs was removed, yielding 23 Bolivian samples for analysis. The identity-by-descent haplotype-sharing analysis was performed using the ERSA software . Although many New World HGDP samples show substantial relatedness, the HGDP samples used here were not inferred to be close relatives in a previous study . Affymetrix 6.0 genotypes for the 210 unrelated HapMap samples were obtained from the HapMap project website, and the same SNP selection criteria were applied to HapMap samples. The filtered HapMap dataset was combined with the dataset generated in this study to assemble a final data set of 815,377 autosomal SNPs for Totonacs (24), Bolivians (23), unrelated HapMap Yoruba (YRI) (60), unrelated HapMap CEPH (CEU) (60), HapMap Han Chinese (CHB) (45), and HapMap Japanese (JPT) (45). Principal components analysis was performed on pairwise allele-sharing distances using the princomp program and plotted with graphics tools provided in the Matlab software package (Mathworks, USA).
Genome-wide admixture estimates and their standard errors were obtained with the Admixture algorithm (version 1.02)  after pruning the data for SNPs with pairwise r2 ≥ 0.2. Runs at an r2 pruning of 0.5, or no pruning, produced similar results. We performed the Admixture analysis to determine which Bolivian samples were admixed and demonstrated that there were two major ancestry components in a subset of Bolivians. We then used the Hapmix program, which is limited to two population comparisons (K = 2), to analyze admixture in the Bolivians. Genome-wide SNPs were assembled for a CEU reference population (60 individuals) and a New World reference population (24 Totonacs plus 13 non-admixed Bolivian individuals). SNP data for each reference population were phased with imputation of missing data using the Beagle software package . Unphased genotypes for all SNPs were assembled for the potentially admixed Bolivian samples. The admixed chromosomes were phased and reconstructed with probability estimates of European (CEU) ancestry using the Hapmix program . Most Hapmix run parameters were set using guidelines as suggested by the authors. Because New World populations have much smaller effective population sizes (N
) than Europeans , the New World recombination parameter, ρ2, was scaled (0.15) relative to the CEU parameter, ρ1. Final runs were performed for each individual and each chromosome, varying the number of generations since admixture (n = 2, 3 … 35). The time of admixture was estimated by computing the likelihood of the data from all chromosomes and all individuals over a range of generations since the admixture event and selecting the value that maximized the summed likelihoods. Individual genome-wide estimates of admixture were calculated as the average expected probability of the number of CEU copies over all SNPs.
To identify ancestry informative markers, each of the 815,377 markers was assessed for ancestry information content between the New World and HapMap groups using standardized allelic variance (f
), calculated as f
)], where p
are the derived allele frequencies in population a, population b, respectively, and p
is the average derived allele frequency in populations a and b. A threshold of f
≤ 0.1 was used to screen for markers with low population differentiation between the Totonacs and non-admixed Bolivians. A threshold of f
≥ 0.3 was used to screen for markers with high variance between a combined Totonac + non-admixed Bolivian population and each Old World population (YRI, CEU, or CHB + JPT). SNPs common to all three New vs. Old World screens were retained (845 markers). This AIMs set was further reduced to 324 AIMs markers by removing 1) one of every pair of SNPs with pairwise r2 exceeding 0.2 in a 100-SNP sliding window advanced by 10 SNPs and 2) all SNPs within 100 kb of one another. To obtain the highly divergent SNP set, we repeated this process but set the minimum value of f
as the 5% tail for each distribution (range 0.3085 to 0.5804, all markers retained). We then required the SNP to be in the upper 5% tail of the Kullback–Leibler divergence (D) for the derived allele i, where
and p1i and p2i are the frequencies of allele i in populations 1 and 2 [35, 36]. We note that the variance and divergence measures are correlated (r = 0.696) but have different distributions. AIMs passing the screening process were checked against HapMap and dbSNP for frequency and strand assignment. Seven highly-differentiated G/C and A/T AIMs were removed due to the possibility of strand assignment confounding.
We empirically determined the ranking of the 324 AIMs by resampling. Subsets of 50 AIMs were randomly selected without replacement from the 324 AIMs. Using the average Native American ancestry estimate from 120,958 genome-wide SNPs as the true ancestry fraction, we iteratively screened for sets of AIMs producing average Native American ancestry component estimates within 10% of the genome-wide average estimate at K = 5 populations and retained 10,000 sets. The AIMs were ranked by total number of times each AIM was seen over all retained sets. Totonacs and non-admixed Bolivians were analyzed independently. The sum of the ranks in the two populations was used to determine the final ranking for each AIM. To assess the minimum number of AIMs need to estimate ancestry, we calculated admixture estimates for Totonacs, non-admixed Bolivians, and admixed Bolivians using sets of 2 to 324 AIMs ranked from most to least informative as described above, and calculated the root mean squared error for each set.
Selection scans were performed using XP-CLR and XP-EHH [22, 37]. For XP-CLR, the New World populations (Totonac and non-admixed Bolivians) were analyzed against a reference population of Eurasians (CEU, CHB, and JPT). XP-CLR is less influenced by SNP ascertainment bias, a known issue with most SNP microarrays [38, 39], and may detect older selection events better than linkage disequilibrium based methods. XP-CLR scans were performed on Beagle-phased haplotypes using a 0.5 cM sliding window and 2 kb grid setting with a maximum of 100 SNPs per window. The XP-EHH analysis was performed using the combined Totonac and non-admixed Bolivians as the test population against the CHB/JPT, CEU, and YRI reference populations. Genomic regions, in 200 kb blocks, were ordered based on the highest scoring SNP in the block and rank determined empirically from the distribution.