Worldwide distribution of NAT2 diversity: Implications for NAT2 evolutionary history

Background The N-acetyltransferase 2 (NAT2) gene plays a crucial role in the metabolism of many drugs and xenobiotics. As it represents a likely target of population-specific selection pressures, we fully sequenced the NAT2 coding region in 97 Mandenka individuals from Senegal, and compared these sequences to extant data on other African populations. The Mandenka data were further included in a worldwide dataset composed of 41 published population samples (6,727 individuals) from four continental regions that were adequately genotyped for all common NAT2 variants so as to provide further insights into the worldwide haplotype diversity and population structure at NAT2. Results The sequencing analysis of the NAT2 gene in the Mandenka sample revealed twelve polymorphic sites in the coding exon (two of which are newly identified mutations, C345T and C638T), defining 16 haplotypes. High diversity and no molecular signal of departure from neutrality were observed in this West African sample. On the basis of the worldwide genotyping survey dataset, we found a strong genetic structure differentiating East Asians from both Europeans and sub-Saharan Africans. This pattern could result from region- or population-specific selective pressures acting at this locus, as further suggested in the HapMap data by extremely high values of FST for a few SNPs positions in the NAT2 coding exon (T341C, C481T and A803G) in comparison to the empirical distribution of FST values accross the whole 400-kb region of the NAT gene family. Conclusion Patterns of sequence variation at NAT2 are consistent with selective neutrality in all sub-Saharan African populations investigated, whereas the high level of population differentiation between Europeans and East Asians inferred from SNPs could suggest population-specific selective pressures acting at this locus, probably caused by differences in diet or exposure to other environmental signals.

diet, cigarette smoke and the environment [1]. Extensive polymorphism in NAT2 gives rise to a wide interindividual variation in N-acetylation capacity. In particular, a clear bimodal distribution is observed that segregates the rapid acetylator phenotype, associated with a normal acetylation capacity, from the slow acetylator one, characterized by a reduced enzyme activity. These two main metabolic phenotypes occur with varying prevalence in populations of different ethnic origin [2].
The clinical consequences of the acetylation polymorphism can be severe if standard drug doses are applied, exposing patients to an increased risk of adverse drug reactions or a lack of therapeutic efficacy [3]. In addition, in the last decades, an increasing number of epidemiological studies have attempted to relate acetylation phenotype to a variety of complex human disorders, such as bladder cancer, atopic diseases, diabetes, Parkinson's disease and many others (see Butcher et al. [4] for a review). However, up to now, association studies in NAT2 have led to conflicting results among (and even within) human populations and most association findings have been difficult to replicate. One reason for these inconsistencies may relate to the fact that almost all studies focused on a limited number of candidate polymorphisms, which were not necessarily the same from one study to another [5,6]. A shift toward a gene-based approach in which all common variation within a gene is considered jointly is advocated for future association studies [7]. By capturing all of the potential risk-conferring variations within NAT2, this approach should resolve much of the controversial issues of candidate-polymorphism studies.
There is now a large body of information on the distribution of NAT2 genetic variants all over the world [8]. However, most published reports used simplified protocols for NAT2 allele detection, omitting analysis of several polymorphic positions (such as G191A, C282T, T341C, and A803G within the coding region). They focused on a limited number of "indicator" mutations, thought to be tightly linked with other mutations and predictive of acetylator status, and have based allele designation on this. Such incomplete genotyping methods wrongly type different alleles as the same and may lead to misclassification of genotypes and deduced phenotypes [9]. Therefore, results of such investigations may be substantially biased and fail to provide an accurate picture of NAT2 allele distribution in worldwide populations.
In an attempt to better characterize the worldwide haplotype diversity and LD structure of NAT2, we performed an extensive survey of the literature to identify those samples that were adequately genotyped for all common variants in NAT2. In total, 41 population samples (including 6,727 individuals) from four continental regions (Africa, Europe, Asia, America) were selected and jointly analyzed. In addition, we performed full sequencing analysis of the NAT2 coding region in a large and ethnically well defined Mandenka sample from Eastern Senegal so as to further characterize African diversity at this locus and to detect novel variants not yet reported. Beyond a simple description of NAT2 gene diversity, the goal of the present study was to provide further insights into the evolutionary forces that most likely shaped NAT2 genetic variation in humans. In particular, three levels of diversity (intrapopulation, interpopulation, and interspecies variability) were used to investigate to what extent present NAT2 variation patterns solely reflect stochastic events of human evolution, or are distorted by natural selection.

NAT2 sequence diversity in the Mandenka and variation in Sub-Saharan Africa
Results of the sequencing analysis of NAT2 in the Mandenka sample are reported in Table 1. A total of 12 polymorphic sites were identified, all located within the NAT2 coding exon. Two of them were singletons confirmed by resequencing. Apart from the seven SNPs that are commonly found in human populations, we found three additional variants that have been recently reported by Patin et al. [10] in several sub-Saharan African samples (C403G, G609T, G838A), and two novel nucleotide changes not yet described: C345T and C638T, which occurred at a frequency of 0.026 and 0.01, respectively. C638T leads to an amino acid change (P213L). Sixteen distinct haplotypes were inferred by the PHASE program, including two which were recently described in the Patin et al. data set [10] and four that are newly described here. Among these six haplotypes, four contain inactivating mutations and are thus predicted as 'slow alleles', whereas the other two (NAT2*12g and NAT2*12H) contain a nonsynonymous mutation (G609T or C403G) whose impact on phenotype is unknown. From diplotype configurations at NAT2 in the Mandenka, we inferred 48.5%, 39.2%, and 7.2% of slow, intermediate, and rapid acetylators, respectively. The remaining 5.1% individuals had an unknown acetylator status as they were carriers of either a NAT2*12g or a NAT2*12H haplotype.
Summary statistics of genetic variation at the NAT2 coding region (870 bp) in the Mandenka and the 12 African samples of Patin et al. [10] are reported in Table 2. Results for the entire surveyed region (1188 bp) in the Mandenka are also indicated. Patterns of diversity in this sample are entirely consistent with those displayed by the 12 other African samples. The mean values of the two nucleotidevariability measures, π and θ w , for the 13 samples are 0.268% and 0.221%, respectively. None of the tests of selective neutrality, performed on each sample both at the intrapopulation and at the interspecies levels, yielded sig-nificant results (not shown). This suggests that patterns of diversity at NAT2 are consistent with the hypothesis of selective neutrality and constant population size. The average sequence divergence between human and chimpanzee for NAT2 coding exon was of 1.6%, and the aver-age substitution rate of 1.6 × 10 -9 per nucleotide and year. All samples provided similar estimates both for Ne, the current effective population size, and T MRCA , the coalescence time back to the most recent common ancestor: average values were 14,218 individuals and 1.077 My, a Polymorphic sites were numbered considering +1 as the A of the translation start codon in the cDNA sequence (GenBank CR407631) b NAT2 haplotypes were named in accordance with the consensus gene nomenclature of human NAT2 alleles [65] c These SNPs/haplotypes have been recently described in Patin et al. [10] d Newly reported SNPs/haplotypes in the present study e NAT2*4 is the reference allele that was shown to be the ancestral haplotype through outgroup comparisons with the chimpanzee and rhesus monkey sequences f Total number of chromosomes in the Mandenka sample respectively, for the NAT2 coding exon. The age of mutations in the gene genealogy ranged from 96,200 (G857A) to 496,300 (A803G) years for the seven SNPs that commonly occur in human populations. The other polymorphisms, only reported to date in sub-Saharan Africans, had all estimated ages < 50,000 years.
NAT2 worldwide genotyping survey A total of 6,727 individuals from 41 worldwide samples were analyzed for their genotype at the seven common SNPs of the NAT2 gene. All SNPs and populations were in Hardy-Weinberg equilibrium after Bonferroni correction for multiple testing. The seven SNPs defined 21 distinct haplotypes, whose composition in terms of SNP variants is given in Table 3.

Continental distribution of NAT2 SNP variants and haplotypes
Allele frequency variation of the seven common SNPs of NAT2 in the 41 sampled populations is shown in Figure  1, and the worldwide distribution of common NAT2 haplotypes is displayed in Figure 2A. Figure 1 [11]. The MDS plot suggests that NAT2 genetic differentiation patterns are related to geography, which is confirmed by a high and significant correlation coefficient (r = 0.47, P < 10 -5 ) observed between genetic and geographic distances.
F ST gives a measure of the proportion of the genetic variance explained by differences among populations. The global F ST value estimated for the 41 worldwide samples was of 0.123 (P < 10 -5 ). When grouping these 41 populations into five major geographic areas (sub-Saharan Africa, Europe/North Africa, Central/South Asia, East Asia, Central America), the vast majority of genetic variation was shown to occur within populations (83.5%), a high proportion (15%, P < 10 -5 ) among geographic groups, and a mere 1.5% (P < 10 -5 ) among populations within groups. When only three geographic groups were consid- SNP position c (ancestral state d ) .  Global  2234  174  2559  221  2  1  1  2075  16  2  8  321  0  180  14  4  107  0  32  36 1 7988 a NAT2 haplotypes were named in accordance with the consensus gene nomenclature of human NAT2 alleles [65] b Only the 28 samples with no missing data (i.e., with genotype data available for the seven common SNPs in NAT2) were considered here c Polymorphic sites were numbered considering +1 as the A of the translation start codon in the cDNA sequence (GenBank CR407631) d The ancestral state of each SNP was deduced from both the chimpanzee and rhesus monkey sequences, and is represented here as a dot e Nucleotide substitutions shown in bold have a functional consequence on enzyme activity. Bold-faced haplotypes are 'slow alleles' associated with as decreased acetylation capacity; the others display an enzymatic activity comparable to the reference 'rapid' allele NAT2*4 f NAT2*6J is a newly described allele found in the Mandenka sample in the present study (see Table 1). It was predicted to be a 'slow allele' since it contains two inactivating mutations g

LD analysis
We first tested whether the amount of LD (measured as the r 2 coefficient between SNP pairs) differed between human populations. All population samples displayed similar levels of LD within each geographic area, except Somali who showed higher LD at NAT2 (average r 2 value = 0.589) than other sub-Saharan African samples and were, in that respect, more similar to Europeans. The mean pairwise r 2 value between the seven SNPs in the European samples (0.567 ± 0.075, including Moroccans) was significantly higher (Wilcoxon's test, P = 0.0002) than in both East Asians (0.276 ± 0.023) and Africans (0.243 ± 0.050, without Somali). No statistical differences in the level of LD were found between East Asian and African populations. However, the proportion of SNP pairs with r 2 ≥ 0.5 was far smaller in sub-Saharan Africans (6.7%) than in both Europeans (40%) and East Asians (33.3%). Ashkenazi Jews exhibited the highest level of LD (average r 2 value = 0.763); such an excess of LD is often observed in founder populations that recently grew from relatively small sizes [12]. We then tested whether the structure of LD was similar among populations by computing the correlation between r 2 matrices of LD. Mantel's tests gave highly significant correlation values both between population pairs within geographic areas and for pairs of continental regions (P < 0.0001 with 10,000 permutations). Thus, although Europeans exhibited higher levels of LD at NAT2, the pattern of LD in this gene was found to be similar across human populations.

Discussion
This study provides a thorough description of NAT2 worldwide genetic diversity. By considering only the samples adequately characterized for the seven common SNPs of the NAT2 gene, we avoided biases arising from incomplete genotyping studies that may lead to both allele and phenotype misclassifications. These seven SNPs are the main polymorphisms occurring in human populations at NAT2 and their joint analysis has been shown to be highly predictive of the acetylation phenotype with a prediction SNP frequencies in the 41 samples of the worldwide genotyping survey  rate close to 100% [13][14][15][16][17]. Such a high concordance between genotype and phenotype suggests that unknown NAT2 variants should be present at low frequencies and therefore may not substantially influence the phenotype prediction in population studies. Although these statements are tenable in populations that have been extensively studied at NAT2, such as Europeans or East Asians, it is not yet known if they hold in populations poorly or inadequately studied for NAT2 gene variation, such as sub-Saharan African populations. As shown by Patin et al. [10] and in this study, these populations usually display a greater allelic diversity and may contain novel variants not previously reported in populations of European or Asian origin. Although two new mutations were observed in our sequencing analysis of NAT2 in the Mandenka sample (C345T and C638T, see Table 1), we did not disclose yet any new major polymorphism apart from the seven acknowledged ones. Consequently, only a small proportion of subjects (5%) would have been classified differently regarding their acetylator status if they had been tested for only the seven common SNPs of NAT2 (these 5% individuals with an unknown acetylator status would have been classified as either intermediate or rapid acetylators). A similar observation has been made for sev-(A) Haplotype frequencies in the 28 samples genotyped for all seven common SNPs at NAT2 (i.e., without missing data), taken from the worldwide genotyping survey . However, that same survey also detected novel variants occurring at non negligible frequencies in several Pygmy populations: up to one fourth of the individuals presented an unknown acetylator status due to the high prevalence of novel mutations with an unknown functional effect. Therefore, further sequencing studies that provide information about the entire frequency spectrum rather than pre-selected variants are required to provide an unbiased description of NAT2 sequence variation in not yet investigated human populations.

Genetic structure of human populations at NAT2
The genetic diversity patterns observed at NAT2 are largely consistent with those reported in many other studies of different gene regions in the human genome. The inferred levels of sequence diversity (Table 2) were found to be consistent with values reported at other highly variable human autosomal loci, such as LPL [18], GYPA [19], and CCR5 [20]. The average sequence divergence between human and chimpanzee (1.6%) was close to previous

Dimension 2 Dimension 1
Sub-Saharan Africa North Africa Europe Central/South Asia East Asia America estimates of putatively neutral genomic regions [21], and the estimated Ne and T MRCA were also found to be in agreement with those of several other nuclear loci, which estimate the human Ne and T MRCA to be close to 10,000 and ~1 My, respectively [22,23]. However, one should be aware that the stochastic nature of the coalescence process used to describe the genealogy, the assumptions that have to be made (for example, the absence of recombination and selection) and the removal of data (rare recombinant haplotypes) can all have important effects on inference and lead to imprecise estimations. Sub-Saharan Africans displayed the greatest haplotype diversity at NAT2 and also had the largest number of unique haplotypes [see Table 3]. Furthermore, haplotypes described outside Africa were essentially a subset of the collection of NAT2 haplotypes found within Africa. These features of molecular diversity are generally interpreted as strong evidence for the 'Out-of-Africa' model which hypothesizes that all modern populations emerged from a common ancestral population in Africa [22]. A linear diversity gradient away from Africa was indeed observed at NAT2 in this study, with African populations showing the highest heterozygosities, then successively decreasing in Europeans and East Asians [see Additional file 1]. This pattern is suggestive of a gradual loss of diversity in successive colonization bottlenecks as our species grew and spread all over the world [24]. Ngawbe and Embera Amerindians displayed comparable levels of haplotype diversity to East Asians (0.42 and 0.57, respectively). However, this last observation contrasts with the recent findings of Fuselli et al. [25] that demonstrated higher NAT2 intra-population genetic diversity in Native Americans than in East Asians, implying more complex processes in the evolution of populations at NAT2 than the simple linear model exposed here above.
Another line of evidence supporting a relatively recent African origin of modern humans came from our analysis of LD patterns at NAT2: the lowest levels of LD were found in African populations, a common finding in empirical studies of LD in human populations. This is consistent with a larger long-term effective size of African popula-tions and/or a bottlenecked population history of non-African populations [22,26]. Even within a small gene like NAT2, the SNP markers appeared to be poorly correlated in sub-Saharan Africans.
We observed a particular pattern of genetic diversity at NAT2 for the Thai sample compared to the other East Asian populations examined. Notably, the frequency of NAT2*4 was found to be significantly lower in Thai (0.30 versus around 0.50 in other Asian populations), resulting in a larger fraction of slow acetylators in this sample (50% versus 5-20%). Interestingly, among the East Asian populations investigated, the Thai sample is the only representative of the variation of NAT2 in Southern East Asia. Because the full NAT2 gene diversity has not yet been investigated in other Southeast Asian populations, it is not possible to conclude whether this population harbors a specific profile with respect to this genetic system, or if it resembles other Southeast Asians. Further data from these latter populations are needed to speculate on a possible genetic differentiation pattern between Northern and Southern East Asian populations at the NAT2 locus.

Possible selective pressures acting on NAT2
Because of its role in the detoxification of exogenous substances, the NAT2 gene has long been considered as a likely target of population-specific selective pressures. But many questions remain about the roles that population history and natural selection have played in shaping the diversity of NAT2. An intriguing point about this gene concerns the high frequency of poor-metabolizers and slow acetylator alleles in most human populations. This might represent the evolution of balanced polymorphisms, maintained by natural selection through heterozygote advantage or spatial-temporal selection of alternative alleles, as it has been shown for G6PD deficiencies or phenylketonuria [27,28]. An alternative explanation is that NAT2 may evolve under no evolutionary constraint, this enzyme being not essential from an evolutionary perspective, maybe because it is dispensable or redundant with other enzymes. In terms of the detoxification of potentially harmful environmental aromatic amines, NAT1 seems indeed to be more active than NAT2, which preempts the latter's role as a key adaptation to increase the fitness of our species [16].
In this study, we investigated whether the patterns of sequence variation at NAT2 were consistent with a standard, neutral equilibrium model in 13 populations of sub-Saharan Africa. All neutrality tests found no evidence of a departure from selective neutrality. But making robust inferences on the action of natural selection at a particular locus requires a thorough characterization of population history, since the neutral null hypothesis is a composite hypothesis that also makes assumptions regarding the demography of the populations. It is typically assumed that the population is in equilibrium at constant size and with no population subdivision. But suppose that African populations did pass through a relatively narrow bottleneck in the late Pleistocene and then expand. In this case, the observed data might reflect the antagonistic effects of a bottleneck (which increases the frequency of rare alleles) and balancing selection (which decreases that frequency), and the apparent evidence in favour of a neutral model of evolution would be an artefact, produced by the confounded effects of these two opposing forces. However, the results of several studies have suggested that sub-Saharan African sequence diversity was compatible with an equilibrium model of long-term constant population size and random mating [29][30][31][32]. They showed that rapid growth from a small initial size was not compatible with the African sequence data and that, if a prehistoric growth occurred, it started from a relatively large Palaeolithic population. Therefore, under such an equilibrium model, NAT2 can be considered as a neutrally evolving gene, at least in the sub-Saharan African populations investigated. Further studies are needed to determine whether this finding can be generalized to all African populations. It would be also useful to increase both the number of individuals studied and the size of the genomic region surveyed (by investigating for instance the entire NAT2 gene sequence which spans around 10 kb) to increase the power of neutrality tests which remains weak when the sample sizes and the numbers of segregating sites are small.
The global level of genetic structure (F ST = 0.123, P < 10 -5 ) was found to be remarkably consistent with the average F ST value for the human genome [33,34], and the high correlation found between genetic and geographic distances (r = 0.47, P < 10 -5 ) implies that patterns of human diversity at NAT2 can largely be accounted for by the simple interaction of drift and geographically-structured gene flow. However, while the overall degree of population structure at NAT2 was similar to previously reported values for neutral markers [35,36], unusual patterns of differentiation were observed between geographic groups. In particular, a striking differentiation of East Asia both from Europe and from Africa was found, a result that could suggest the action of region-or of population-specific selective pressures. A large variance in F ST values was observed for individual SNPs (  [38] and would increase interpopulation differentiation at linked neutral sites, as distinct haplotypes would be fixed in different populations. Besides, it is also interesting to note the high F ST values in the region surrounding the NAT1 gene and, surprisingly, in a noncoding segment preceding the NATP pseudogene.
By contrasting the FST of individual SNPs to the empirical distribution of FST across a genomic region, it is possible to identify those loci that exhibit unusual patterns of population differentiation as potential candidates for the action of natural selection [39,40]. Several studies have used this strategy to detect the action of selection on specific genes or on a genome-wide scale [33,34,41]. However, genome-wide surveys of FST have demonstrated substantial variation of FST values across the genome, even among SNPs that are very close to each other [33,34], thus stressing the difficulty to distinguish between nonrandom events such as local selection and random events such as extreme genetic drift as the agents responsible for the unusual patterns observed. Individual-marker FST estimates are probably too variable to be reliable indica-tors of past selective events and other more powerful tests are required to provide unambiguous evidence of natural selection. The availability of genotype data for additional markers surrounding the NAT2 gene would enable the implementation of the long-range haplotype test [42] which has better power for identifying signatures of recent positive selection.
A possible explanation for the unusually large difference in allele frequencies observed between European and East Asian populations for the NAT2 variants could be the impact of population-specific selective pressures. The main molecular basis for the high discrepancy between Europeans and East Asians is that the most common allele at the NAT2 locus in Europeans (NAT2*5B) is very rare in East Asians and could represent a different selective advantage within the gene pools of these separate populations. Patin et al. [43] found evidence of a rapid increase in frequency of the NAT2*5B haplotype in Western and Central Eurasian populations in the last ~6,500 years in response to positive selection, suggesting that this slow allele probably conferred some selective advantage to its carriers in this part of the world. A thorough survey of NAT2 sequence variation in East Asians will be necessary to determine whether the predominance of the rapidacetylator NAT2*4 allele over the slow ones is the result of local positive selection or whether it can be explained by stochastic processes such as genetic drift.

Conclusion
This study provides a thorough description of the worldwide haplotype diversity and LD structure of the NAT2 gene. We found that patterns of NAT2 sequence variation are consistent with selective neutrality in all sub-Saharan African populations investigated, whereas the high level of population differentiation between Europeans and East Asians inferred from SNPs may suggest population-specific selective pressures acting at this locus, probably caused by differences in diet or exposure to other environmental signals.

NAT2 sequencing of the Mandenka
Full sequence diversity of NAT2 exon 2, which contains the entire protein-coding region, was determined in 97 healthy unrelated individuals (62 men, 35 women) from the Niokholo Mandenka. This agriculturalist population from Eastern Senegal speaks a language belonging to Mande, a major primary branch of the Niger-Congo language family. We used DNA extracted from the lymphoblastoid cell lines (LCL) described in Excoffier et al. [44]. The sample size considered was sufficient to detect NAT2 variants present at a frequency ≥ 3%, with a probability of at least 99%.  (11 samples) or two SNPs (2 samples) out of the seven were missing. It involves either G191A (which has been shown to be extremely rare in Europeans and Asians) or the synonymous C282T polymorphism. These samples were excluded from LD analyses.

Statistical analyses
All sequence analyses were performed on both the Mandenka sample and on each of the 12 African samples of Patin et al. [10]. Homologous sequences from one chimpanzee (Pan troglodytes; Ensembl Chinpanzee genome) and one rhesus monkey (Macaca mulatta; GenBank XM_001098734) were used to infer SNPs' ancestral state.
Each of these also served as an outgroup for evolutionary and population genetic tests.
DnaSP v.4.10 [47] was used to compute, in each sample, the nucleotide (π) and haplotype (H) diversity, Watterson's θ w [48], as well as to perform several neutrality tests to detect signals of natural selection: Tajima's D [49], Fu and Li's F* and D* [50], Fu's F s [51], and Fay and Wu's H [52]. The statistical significance of the tests was estimated from 10,000 coalescent simulations of an infinite site locus, conditional on sample size, both with and without recombination. The McDonald-Kreitman test [53] was applied to detect deviation from the neutral expectation of equal rates of nonsynonymous to synonymous polymorphism within humans and nonsynonymous to synonymous fixed substitutions between humans and chimpanzee.
A coalescence model for the ancestral history of a sample of genes was used to estimate the time scale of polymorphic variation in the NAT2 gene. The time to the most recent common ancestor (T MRCA ) and mutation ages were estimated from the NAT2 gene tree, conditional on a maximum-likelihood estimate of θ( ), the population mutation parameter. These estimates were computed with GeneTree v.9.0 [54] (running 10 6 replications), under a standard coalescence model assuming neutrality, the infinite-sites mutation model (haplotypes presumably affected by recurrent mutation or recombination were removed from the analysis), random mating, and constant population size. All estimates were inferred on individual population samples, to avoid biases due to population structure. Time, scaled in 2Ne units, was converted into years by use of a 25-year generation time and the value of the effective population size (Ne) obtained as divided by 4 μ. The neutral mutation rate per gene per generation (μ) was estimated based on human-chim-panzee sequence divergence, assuming a divergence time of 5 million years (My) [55].
For the worldwide genotyping survey, we inferred NAT2 haplotypes from the unphased multi-locus genotypes using PHASE v.2.1 software [46]. The individual acetylation phenotypes were then predicted from the haplotype combination at NAT2, in accordance with the acknowledged classification of NAT2* alleles based on their functional impact [see Table 3]: individuals with two low activity alleles were classified as slow acetylators, those with two functional alleles as rapid acetylators, and those with both a slow and a functional allele as intermediate acetylators.
Median-joining networks [56] describing the mutational relationships among the inferred NAT2 haplotypes were generated using Network 4.1.1 software [57].
Population structure in the worldwide genotyping survey set was investigated by an analysis of molecular variance (AMOVA) [58] that included the molecular distance matrix among NAT2 haplotypes. Population differentiation was tested by permutation tests (20,000 permutations) based on the F ST statistic. Coancestry coefficients, or linearized F ST values [59], were computed among populations and the resulting genetic distance matrix was used for multidimensional scaling analysis (MDS) [60] performed with the NTSYS v.2.1 software [61]. A Mantel test was applied to test the correlation of pairwise genetic distances with geographic distances, computed as great-circle distances between populations from their coordinates of latitude and longitude (US Caucasians and Ashkenazi Jews were excluded from the analysis since they could not be allocated precisely to a specific area). All calculations, including random-permutation procedures to assess statistical significance, were performed by use of the Arlequin v.3.0 package [62].
Pairwise LD between the seven genotyped SNPs was estimated by computing the r 2 statistic [63] with DnaSP [47], after the exclusion, in each population of the worldwide genotyping survey, of SNPs with minor allele frequency (MAF) < 0.05. Statistical significance of LD between SNP pairs was assessed using Fisher's exact tests followed by Bonferroni corrections. Mantel tests to compare r 2 matrices were performed using the program CADM [64]. Comparisons were made between populations within each continental group. Subsequently, r 2 values were recalculated for populations pooled into geographical groups and Mantel tests were applied.
draft of the manuscript. AL and PD participated to study design, supervised analyses, and were involved in drafting the manuscript. AL and EP provided the Mandenka sample. NG and RK participated in the NAT2 gene sequencing in the Mandenka sample. EP provided the conceptual framework for the study, supervised the statistical analyses and finalized the manuscript. All authors read and approved the final manuscript.