Inferring haplotypes at the NAT2 locus: the computational approach
 Audrey Sabbagh^{1, 1}Email author and
 Pierre Darlu^{1, 1}
DOI: 10.1186/14712156630
© Sabbagh and Darlu; licensee BioMed Central Ltd. 2005
Received: 02 February 2005
Accepted: 02 June 2005
Published: 02 June 2005
Abstract
Background
Numerous studies have attempted to relate genetic polymorphisms within the Nacetyltransferase 2 gene (NAT2) to interindividual differences in response to drugs or in disease susceptibility. However, genotyping of individuals singlenucleotide polymorphisms (SNPs) alone may not always provide enough information to reach these goals. It is important to link SNPs in terms of haplotypes which carry more information about the genotypephenotype relationship. Special analytical techniques have been designed to unequivocally determine the allocation of mutations to either DNA strand. However, molecular haplotyping methods are labourintensive and expensive and do not appear to be good candidates for routine clinical applications. A cheap and relatively straightforward alternative is the use of computational algorithms. The objective of this study was to assess the performance of the computational approach in NAT2 haplotype reconstruction from phaseunknown genotype data, for population samples of various ethnic origin.
Results
We empirically evaluated the effectiveness of four haplotyping algorithms in predicting haplotype phases at NAT2, by comparing the results with those directly obtained through molecular haplotyping. All computational methods provided remarkably accurate and reliable estimates for NAT2 haplotype frequencies and individual haplotype phases. The Bayesian algorithm implemented in the PHASE program performed the best.
Conclusion
This investigation provides a solid basis for the confident and rational use of computational methods which appear to be a good alternative to infer haplotype phases in the particular case of the NAT2 gene, where there is near complete linkage disequilibrium between polymorphic markers.
Background
Nacetylation polymorphism is one of the earliest discovered and most intensively studied pharmacogenetic traits that underlie interindividual and interethnic differences in response to xenobiotics. In humans, acetylation is a major route of biotransformation for many arylamine and hydrazine drugs, as well as for a number of toxins and known carcinogens present in the diet, cigarette smoke and the environment [1–3]. Genetically determined differences in Nacetylation capacity have been proved to be important determinants of both the effectiveness of therapeutic response and the development of adverse drug reactions and toxicity during drug treatment [4]. In the last decades, numerous investigations have been made to elucidate the genetic basis of Nacetylation polymorphism in various ethnic groups in order to develop efficient genotyping tests and to adapt therapies to specific patients and populations in accordance with their genetic makeup. Some of the drugs excreted by acetylation are indeed crucial in the treatment of diseases representing a worldwide concern, such as tuberculosis and AIDSrelated complex diseases [5, 6]. Moreover, a number of epidemiological studies have suggested possible associations between the Nacetylator phenotype and a variety of complex human diseases, the most consistent findings being those regarding urinary bladder cancer and familial Parkinson's disease [7–10].
The gene coding for the arylamine Nacetyltransferase 2 (NAT2) enzyme has been established as the site of the classic human acetylation polymorphism [11–13] and the molecular basis of individual and interethnic variation in acetylation capacity is now well documented [14, 15]. All mutations reported to date are found within the 870bp coding region of the NAT2 gene. Among the seven single nucleotide polymorphisms (SNPs) that are commonly found in human populations, four result in an amino acid substitution that leads to a significant decrease in acetylation capacity (single basepair substitutions at positions 191, 341, 590, 857). The other three are either silent mutations (C282T, C481T) or a nonsynonymous substitution that does not alter phenotype (A803G).
In the consensus gene nomenclature of human NAT2 that encompasses all currently recognized alleles [16, 17], sets of SNPs located throughout the coding region are linked in terms of haplotypes, that is they are organized as they segregate together on one individual's chromosome at the NAT2 locus. Each combination of SNPs identified so far constitutes a distinct haplotype that is treated as an allele of the haplotype system. The consideration of multilocus haplotypes seems more desirable since there is growing evidence that for genes containing multiple SNPs in high linkage disequilibrium (LD) such as NAT2 [18], haplotype structure rather than individual SNPs can be the principal determinant of phenotypic consequences [19–21]. A functional polypeptide is indeed the product of a haplotype, covering the entire coding region and coded by a single chromosome.
The major human NAT2 alleles and their associated phenotype^{a}.
Allele  Nucleotide change ^{b}  Phenotype  

G191A  C282T  T341C  C481T  G590A  A803G  G857A  
NAT2*4  rapid  
NAT2*5A  x  x  slow  
NAT2*5B  x  x  x  slow  
NAT2*5C  x  x  slow  
NAT2*6A  x  x  slow  
NAT2*6B  x  slow  
NAT2*7A  x  slow  
NAT2*7B  x  x  slow  
NAT2*12A  x  rapid  
NAT2*12B  x  x  rapid  
NAT2*13  x  rapid  
NAT2*14A  x  slow  
NAT2*14B  x  x  slow 
However, in spite of its high relevance, this issue has not been handled properly by most past studies investigating NAT2 polymorphisms. Early genotyping studies only screened for the presence of three polymorphisms (C481T, G590A, G857A), and a subject was defined as a slow acetylator if he was homozygous for one, or heterozygous for two (each located on one DNA strand), of this three nucleotide changes. Such a definition assumed that there could be no single allele with two or more of the tested mutations. In other studies that screened a larger number of SNPs within NAT2, patterns of LD between point mutations were often assumed, in reference to the haplotypes previously described and which are commonly found in populations of European origin. For instance, the designation of NAT2 alleles is usually based on the assumption that 481T and 803G are strongly linked to 341T, and 590A and 857A are linked to 282T [22]. However, in rare cases the typical allelic linkage pattern of mutations may be disrupted because of genetic recombination and this may result in misclassification of alleles. Indeed, although such assumed linkage patterns are very strong, other allelic variants carrying either unusual combinations of mutations or mutations in isolation have been described in a few cases [23]. Furthermore, the designation of NAT2 alleles in such a way that it necessarily conforms to the existing consensus nomenclature of acknowledged haplotypes precludes the disclosure of unexpected combinations of mutations, different from the established allelic variants, and hence, the discovery of new alleles. Such a manner of inferring haplotypes from unphased multilocus genotypes may introduce biases in NAT2 allele designation and individual phenotype prediction, and these potential biases are of higher magnitude when nonEuropean populations are concerned. Most studies assumed particular patterns of linkage previously described in populations of European origin, but which may not hold in other ethnic groups. Indeed, recent works have shown that patterns of LD can differ markedly among populations with different ethnic and demographic backgrounds. As an example, Loktionov and colleagues [24] pointed out the high occurence of isolated mutations 803G and 282T (defining alleles NAT2*12A and NAT2*13, respectively) in Black South Africans, while these nucleotide changes are almost always tightly linked to other mutations in European populations [25]. Likewise, Dandara et al. [26] recently identified a novel mutation linkage pattern (NAT2*6E) that appeared to be common in three African populations and that had not yet been reported in Europeans. As well, Anitha et al. [27] revealed a new combination of acknowledged mutations (NAT2*5G) in the Malapandaram tribe of South India that has not been described so far in any other world population. The genotypephenotype discordance observed in many ethnic groups where mutation linkages have not been extensively proven experimentally might result from such unexpected compound alleles. It is thus necessary to systematically verify the postulated allelic combinations.
To avoid such potential problems, many authors designed special analytical techniques to unequivocally determine the allocation of mutations to either DNA strand. Molecular methods using combinations of mutationspecific polymerase chain reaction (PCR) reamplification coupled to restriction mapping of the PCR products have been developed; these allow the separate analysis of each allele in order to obtain a complete map of both genes in every individual. Some studies applied these procedures to all multiply heterozygous subjects [24, 28–31], while others limited their application to particular cases as those where an alternative linkage pattern of mutations would lead to a change in phenotype [5, 27, 32–36]. However, these experimental methods of molecular haplotyping are not entirely satisfying because they entail an additional cost and are currently labourintensive, timeconsuming, prone to experimental errors and difficult to automate. Therefore, they do not appear to be good candidates for routine clinical applications and for a generalization at a large scale.
A cheap and relatively straightforward alternative for haplotype reconstruction based on genotype data from unrelated individuals is the use of computational algorithms. The most widely used algorithms developed so far are based either on a parsimony, a maximumlikelihood, or a Bayesian approach (see [37] for a review). In the last decade, numerous investigations based on empirical data and extensive simulation studies have demonstrated that such in silico haplotypeinference methods could give effective and accurate prediction of haplotype phases, especially in regions with high LD values between polymorphic sites and small probabilities of recombination events [18, 38, 39]. Therefore, they could be fairly efficient alternatives to molecularhaplotyping methods when applied to NAT2 gene data. Surprisingly, to our knowledge, only three studies have used computational methods to reconstruct NAT2 haplotypes and estimate allele frequencies in population samples [40–42]. One explanation for such limited use may be the lack of evidence documenting the performance of in silico approaches when applied to actual NAT2 data. Indeed, the accuracy of these strategies, compared with molecular methods, needs to be assessed before their applications can be advocated at a large scale. A recent study provided preliminary results on this issue: Xu and colleagues [18] empirically evaluated and compared the accuracy of the Clark's algorithm [43], the expectationmaximisation (EM) algorithm and a Bayesian method implemented in the PHASE program [44] in phase inference at NAT2, taken as an example of a locus with pronounced LD over a 850bp region. In this study, NAT2 haplotypes (consisting of five genotyping SNPs at position 282, 341, 481, 590, and 803 nt) were experimentally determined through cloning and sequencing in 81 individuals of European ancestry. They found that all three computational methods provided remarkably accurate and reliable estimates for NAT2 haplotype frequencies and individual haplotype phases.
The objective of the present study was to extend this investigation to more precisely assess the performance of the computational approach. We conducted an extensive study based on experimental data from a larger number of samples, issued from populations of various ethnic origin, and tested for haplotypes involving the seven major polymorphic loci of NAT2. Furthermore, the larger population samples investigated are of greater significance: as the sample size grows, there is more opportunity to observe rare haplotypes that are the most difficult to infer statistically. This comparative study is designed to evaluate the performance of different haplotyping algorithms and to assess the consistency of their estimates. In addition, it provides information on the impact of various dataset characteristics (sample size, haplotype frequency distribution, haplotype frequencies, deviation from HardyWeinberg (HW) equilibrium, ...) on estimation accuracy; we then explore the utility of databased diagnostics for assessing probable accuracy.
Results
Among the 1608 individuals investigated over the five data sets, 45.5% (732/1608) were either homozygous for all SNP sites or heterozygous at only one SNP site; thus, their haplotype pairs could be assigned directly. Besides, 35.7% (574), 10.0% (160), 0.9% (15), and 7.9% (127) individuals were heterozygous at two, three, four and five SNP sites, respectively. We inferred their haplotype phases with four computational haplotyping methods, and compared the results with those obtained through molecular haplotyping.
Since the Hapar program often provided several equally parsimonious solutions for a given multilocus genotype, it could not resolve a relatively large fraction of heterozygous individuals in each sample and hence, we could not deduce frequency estimates for the haplotypes observed. Therefore, we evaluated Hapar only on its ability to identify the set of haplotypes present in a sample.
Haplotype identification
Performance of the four computational methods in haplotype identification, as measured by the I_{ H }index.
Hapar  PLEM  Haplotyper  PHASE  

258 Spanish [45]  0.933 (1)  0.933 (1)  0.933 (1)  0.933 (1) 
137 Nicaraguans [30]  0.952 (1)  0.952 (1)  0.909 (2)  0.952 (1) 
112 UK Caucasians [24]  1  1  1  1 
101 Black South Africans [24]  0.917 (2)  0.917 (2)  0.917 (2)  1 
1000 Koreans [31]   *  1  1  1 
Prediction of individual haplotype phases
Individual error rate in haplotype reconstruction
PLEM  Haplotyper  PHASE  

258 Spanish [45]  0.39%  0.39%  0.39% 
137 Nicaraguans [30]  2.19%  3.65%  2.19% 
112 UK Caucasians [24]  0.89%  0%  0% 
101 Black South Africans [24]  3.96%  3.96%  2.97% 
1000 Koreans [31]  0.30%  0.30%  0.30% 
Estimation of haplotype frequencies
Index of similarity (I_{ F }) between haplotype frequencies estimated with and without molecular haplotyping information.
PLEM  PHASE  

258 Spanish [45]  0.996  0.996 
137 Nicaraguans [30]  0.986  0.986 
112 UK Caucasians [24]  0.994  0.998 
101 Black South Africans [24]  0.981  0.988 
1000 Koreans [31]  0.997  0.998 
Average change coefficients of PLEM and PHASE programs computed for three classes of haplotype frequency.
Haplotype frequency  

< 1%  1–5%  > 5%  
PLEM  30.6%  8.3%  1.2% 
PHASE  17.4%  6.8%  1.2% 
Partially resolved data sets
We also performed similar analyses on six other previously published data sets, in which linkage phase patterns were only partially resolved by molecular haplotyping. These data concerned 844 German [32], 248 Polish [33], 303 Turkish [34], 50 noncaste Dogons from Mali, 52 Gabonese and 60 Caucasians [5]. Haplotype phase information was available for 41%–74% individuals in these six population samples (including phaseresolved genotypes as well as non ambiguous homozygous or simply heterozygous genotypes). The PHASE algorithm was applied on the unphased genotype multilocus data of each of these samples, and a 100% concordance was observed between individual haplotype phase reconstruction through the computational method and the empirically determined linkage patterns, for all investigated data sets (data not shown). This means that, despite the efforts invested, in terms of work, time and money, to resolve mutation linkage phase in a part of each sample, no more information was added by molecular haplotyping than what could be extracted from computational algorithms applied to these data.
Discussion
This empirical study demonstrates how closely the frequencies computationally estimated from phaseunknown data approximate those from genecounting estimates based on phaseknown data. In the particular case of the NAT2 gene, where there is near complete LD between SNPs within the coding region, all in silico approaches provided highly effective and accurate estimates for haplotype frequencies and individual haplotype phases. Estimated frequencies of common haplotypes were nearly identical to those empirically determined, whereas rare haplotypes were occasionally miscalled when their presence/absence had to be inferred. As already pointed out by Stephens et al. [44] and Lin et al. [39] and confirmed in this study, lowerfrequency variants are less easily estimated statistically; indeed, there is less contextual information about phase for singletons versus nonsingletons. Thus, for those research questions for which the NAT2 common haplotypes are most important, frequency estimates based on the unphased SNPtyping results from unrelated individuals will be sufficient. However, accurate identification of rare haplotypes may be critical for many researchers, such as population geneticists interested in detecting features of recent demographic history that are populationspecific or signatures of selective effects in NAT2 sequences; as well as for epidemiologists and clinicians concerned with the possibility that rare haplotypes may be important for disease risk or for predicting drug response. In such cases, molecular haplotyping will be necessary to determine linkage phase unambiguously [57].
For a locus such as NAT2 where a strong haplotypic structure is observed, all algorithms provided highly effective and accurate results for haplotype reconstruction. Thus, such "ideal" data for statistical inference did not permit to properly discriminate between the different methods investigated. Nevertheless, despite roughly similar performances, slightly better results were observed with the PHASE program. In particular, PHASE outperformed the other programs when frequencies of rare haplotypes have to be inferred. This is consistent with the results of some previous studies which evaluated and compared the performance of several algorithms on both empirical and simulated data [44, 54, 61]. PHASE provided the most accurate reconstructions, probably because the true haplotypes conformed more closely to the assumptions of the approximate coalescent prior than to those of the Dirichlet prior.
Many factors may influence the estimation accuracy of computational approaches. They can be assessed empirically within a dataset, to be further used as "diagnostics" for predicting potential inaccuracies in estimation caused by features in the relevant data set [38].
Sample size did not appear to have a large effect on the haplotype frequency estimates comparing phaseknown and phaseunknown results for the five data sets included in this study. Perhaps the low error rate observed in Koreans is partly due to the huge size of this sample (1000 individuals): an improvement in accuracy of the estimation procedure with increased sample size is indeed expected since information redundancy in the form of multiple copies of the same haplotype in the data set is required for the statistical algorithms to work properly [38, 48]. On the other hand, computational methods may also perform best in small samples, in which there is little chance to observe rare haplotypes that are the most difficult to infer statistically. Nevertheless, since the number of new haplotypes is not expected to increase linearly with sample size, the analysis of sufficiently large samples should guarantee a good reliability in the resulting estimates.
The five phaseresolved NAT2 molecular data sets investigated.
Population sample  Proportion of phaseunknown multiple heterozygotes^{a}  NAT2 gene diversity^{b}  Deviation from the HardyWeinberg equilibrium (exact pvalue)^{c} 

258 Spanish [45]  66.7%  0.65  0.012 
137 Nicaraguans [30]  59.1%  0.70  0.072 
112 UK Caucasians [24]  52.7%  0.69  0.222 
101 Black South Africans [24]  63.4%  0.86  0.122 
1000 Koreans [31]  50.0%  0.52  0.016 
Among the five data sets investigated in this study, Black South Africans displayed the highest error rate in haplotype computational inference. One possible explanation may be the presence in this sample of a large number of different multiple heterozygotes with ambiguous multilocus genotypes, occurring at roughly similar frequencies (this is reflected in the high NAT2 gene diversity displayed by this population sample (Table 2)). Indeed, both the number of different ambiguous multiple heterozygous genotypes and their relative frequencies have been shown to be of high importance in the assessment of haplotype estimation accuracy [57]: both would be good indicators of the difficulty level of a given data set for haplotyping algorithms. The existence of many different multilocus genotypes uniformly distributed implies that many different haplotypes occur at low frequency, and that, consequently, greatest error and uncertainty occur in the estimation of haplotype frequencies (since no single haplotype is overwelmingly frequent). In contrast, the presence of a small number of multiple heterozygous genotypes at proportionately high frequencies implies that some individual haplotypes exist at high frequencies, and the estimation of those haplotype frequencies will be easier and accomplished with greater accuracy [38, 57]. In such cases, molecular haplotyping may add little information for the resolution of haplotype phases.
The amount of LD between SNP markers may be another determining factor for the prediction of estimation reliability since when multiple polymorphic sites display little disequilibrium, as was observed in the African sample compared to the others, a large proportion of the chromosomes may occur as uncommon or rare haplotypes, implying a greater difficulty level in the statistical inference of haplotypes.
Therefore, we advocate to examine beforehand the unphased NAT2 genotype data for both the frequency distribution of multiply heterozygous genotypes and the level of LD between polymorphic markers; this will allow to assess the difficulty level displayed by the data set for statistical inference, and hence, to predict the ability and accuracy with which computational algorithms would infer haplotype phases from such data.
Of course, statistical methods can be used in conjunction with experimental methods to provide more accurate estimates of individual haplotypes. It has been claimed that the ability of certain computational methods to accurately assess the uncertainty associated with each phase call gives them the substantial practical advantage of allowing experimental effort to be directed at sites and/or individuals whose phases are most difficult to reconstruct statistically or that are critical to the conclusions of the study [20, 44, 61]. However, in our study, we observed that most erroneous phase calls inferred at the individual level were strongly supported, with a probability close to the maximal value of 1. Thus, these errors could not have been avoided since they would not have been selected for the molecular targeting. This stresses why, in the case of the discovery of a novel NAT2 allele through computational haplotyping methods, the unusual linkage pattern should always be confirmed by cloning and sequencing the allele under question, as advocated by Cascorbi and Roots [25] for novel allelic combinations detected by molecular techniques.
Throughout this study, we assumed that the NAT2 linkage patterns molecularly determined were the "true" ones, and hence, that there was no error in the haplotype assigments based on experimental methods. However, molecular techniques may have experimental error rates as high as the rate of statistical error associated with the computational haplotype determination algorithms [19]. Indeed, molecular haplotyping bears the risk of false positive or false negative allelespecific amplification (because of the nucleotidedependent specificity of that technique) as well as uncomplete or nonspecific digestions with the enzymes used in restriction analyses [25]. In the present study, we have estimated the computational error rates to be of no more than 3–4% for all investigated algorithms. This is not higher than the corresponding error rate from molecular haplotyping techniques, on the order of 2–3% [20]. Therefore, it is difficult to determine whether the discrepancies observed between experimental and computational estimates are actually due to statistical errors from algorithms; they may be due to technical errors during manipulations and the molecular data used as a reference for comparisons might be wrong.
The disadvantage of in silico approaches is that algorithmic techniques are statistical and require the analysis of a population rather than a single individual. This is not a limitation in clinical trials and epidemiological surveys, which are always performed on a cohort basis. In clinical pharmacy, however, if a specific individual's haplotypes are of interest to predict his response to drug treatment, his unphased multilocus genotype must be combined with a standard reference set of haplotypes to infer the phasing [20]. This implies a thorough knowledge of the NAT2 genotypic distribution in the ethnic population from which this individual was drawn.
Conclusion
This study demonstrates that computational methods can provide an effective and accurate prediction of haplotype phases, in the particular case of the NAT2 gene which displays high values of LD between polymorphic sites. The objective of this study is not to advocate the systematic use of computational approaches for NAT2 haplotype inference at the expense of molecular haplotyping methods. We are convinced that these last ones remain the most reliable and effective way to resolve linkage phase patterns and that they can produce, for a fixed sample size, much more precise estimates of haplotype frequencies than other approaches [63]. However, the considerable effort required to obtain and analyse individual chromosomes make alternative designs preferable; and the in silico approach appears to be the most practical one. Thus, for researchers not willing to invest time and money in the preliminary step of NAT2 haplotype reconstruction, the use of computational algorithms constitutes a safe and effective way to get reliable haplotypic data on which further analyses could be carried on. Once haplotypes are constructed, various statistical methods can be applied on NAT2 haplotype data to detect alleledisease associations or to classify patients according to their acetylation status.
Methods
NAT2 molecular data sets
To evaluate the performance of in silico approaches in NAT2 haplotype reconstruction, we based our study on data collected from the literature, for which linkage phase was resolved directly through molecular haplotyping. Molecular data from five previously published data sets were analysed: they concerned 258 Spanish from Central Spain [45], 137 Nicaraguans with a Central American IndianEuropean mixed origin [30], 112 British from the Cambridge area [24], 101 Black South Africans (mostly Tswanaspeaking people) [24], and 1000 Koreans [31]. All subjects included in these studies were randomly selected, unrelated healthy volunteers whose ethnic origin had been clearly defined. In each population sample, seven SNPs were typed at NAT2 for all individuals (no missing data), and mutation linkage phase of all multiply heterozygous individuals was resolved molecularly through allelespecific PCR and restriction mapping. A summary description of the data sets is given in Table 2. These data provide an opportunity to compare haplotype frequencies estimated by direct gene counting on experimentally haplotyped data with haplotype frequencies estimated by haplotyping algorithms when phase information is ignored.
Throughout this report, we will use the term «phaseknown» to refer to the individual's genetic constitution for the NAT2 haplotyped system, including the linkage phase of the component SNP alleles. Whereas we will use the term "phaseunknown" to refer to an individual's multilocus genotype in the absence of phase information.
Computational haplotyping methods
We evaluated the ability of four populationbased haplotype inference methods to reconstruct NAT2 haplotypes from the phaseunknown genotype data.
Hapar
The first method is based on maximum parsimony: it searches for a set of minimum number of haplotypes that explain the observed genotype samples. Clark's method, the first developed algorithm for haplotype reconstruction [43] which can be viewed as a sort of parsimony approach, requires homozygote or singlesite heterozygote in the sample to start its inferential cascade. Wang and Xu [46] overcame this limitation by designing an algorithm with a global optimization goal. This method, recently implemented in the Hapar program [46], was tested at its default settings on the phaseunknown NAT2 data.
PLEM
We also applied the EM algorithm [47] to obtain the maximumlikelihood estimates of haplotype frequencies in the samples, given the observed data [48–50]. This algorithm starts with initial arbitrary values of haplotype frequencies and iteratively updates the frequency estimates, to maximize the loglikelihood function, until convergence is reached. Several EMbased algorithms have been developed. We used three different implementations, Arlequin [51], HAPLO [49] and PLEM [52], that all gave us identical results on comparable analyses (data not shown). Thus, we presented only the results obtained with the PLEM program: this software implements an algorithm derived from the standard EM but incorporating the computational strategy of partitionligation [53] to handle a larger number of loci. We performed 50 independent runs with different initial conditions to minimize chances of local convergence so as to ensure finding the global maximum likelihood estimates. For a given unphased genotype pattern, the probability of each possible haplotype configuration was calculated by using the estimated population haplotype frequencies, and all compatible haplotype phases with nontrivial probabilities were generated. The haplotype pair with the greatest probability was considered to be the haplotype phase for each individual, and population haplotype frequencies were estimated as a function of each inferred haplotype pair, weighted by their estimated probability.
Haplotyper and PHASE
Finally, two Bayesian statistical methods based on Gibbs sampling procedure were applied to the phaseunknown NAT2 data. Such methods treat the unknown haplotypes as random quantities and combine prior information beliefs about what sorts of patterns of haplotypes are expected to be observed in population samples with the likelihood the information in the observed data [54]. The conceptual difference between the two investigated Bayesian algorithms lies in the prior information incorporated into the statistical model. The algorithm implemented in the Haplotyper program [53] uses a Dirichlet prior distribution, which assumes that the genetic sequence of a mutant offspring does not depend on the progenitor sequence [54]. Instead, the algorithm implemented in the PHASE program [44] uses a prior approximating the coalescent, which is one of the evolutionary models most commonly used in population genetics (see [55] for a review): it assumes that unresolved haplotypes will tend to be the same as, or similar to, known haplotypes. We employed the lastest version of PHASE (PHASE v 2.1 [54]) to evaluate the performance of this method, using the default parameter values in the Markov chain Monte Carlo simulations. For each data set investigated, we applied the algorithm ten times with different seeds for the random number generator, and checked for consistency of the results across the independent runs in order to verify that the algorithm did not converge to a local, rather than global, mode of the posterior distribution. We chose the results from the run displaying the best average goodnessoffit of the estimated haplotypes to the underlying coalescent model. Besides, to evaluate the Bayesian algorithm implemented in Haplotyper, we performed 20 independent runs of the program on each sample. This software could not be run directly on the 1000Korean sample as it can only handle 500 individuals at most per data set. To circumvent this limitation, we randomly generated ten pairs of complementary data sets, each composed of 500 individuals, and we ran Haplotyper on each of them. Results were averaged over the ten complete data sets. Both programs Haplotyper and PHASE provide a list of the most likely pairs of haplotypes for each subject. They also quantify the uncertainty associated with each phase call by outputting an estimate of the probability that each call is correct. This prevents inappropriate overconfidence in statistically reconstructed haplotypes.
Measures of estimation accuracy
Computational algorithms of haplotype reconstruction may be used for many different purposes. We focus here on three particular tasks: finding the list of all haplotypes present in a sample, inferring the most likely pair of haplotypes for each sampled individual, and estimating haplotype frequencies in the population. Thus, three different measures of accuracy were used to evaluate the performance of the tested algorithms.
haplotype identification
To assess accuracy in terms of haplotype identification, we used the I_{ H }index introduced by Excoffier and Slatkin [48]. It compares the number of different haplotypes detected experimentally with the number of different haplotypes inferred by the computer programs. We considered that a given haplotype is identified as being present in the true sample if its estimated frequency is above the threshold value of 1/(2n) in a population sample of n individuals. I_{ H }is given by:
where k_{ true }is the number of haplotypes in the true sample, k_{ est }is the number of estimated haplotypes with frequency above the threshold, and k_{ missed }is the number of true haplotypes not identified in the sample.
Values of I_{ H }can vary between 1 (when the computational identified haplotypes are exactly the same as those determined experimentally) to 0 (when none of the true haplotypes are identified computationally).
 reconstruction of the haplotypes of each sampled individual
We specified the haplotype pair for an individual by choosing the most probable haplotype pair consistent with the individual's multilocus genotype. We measured performance by the individual error rate, which is the proportion of individuals whose haplotype pairs were incorrectly inferred by the program [53].
 estimation of sample haplotype frequencies
To examine how close the computationally estimated haplotype frequencies are to the observed frequencies in the phaseknown data, we used the similarity index I_{ F }of Renkonen [56], defined as the proportion of haplotype frequencies in common between the estimated and observed frequency distributions [18, 48].
Since this index gives more weight to the highfrequency haplotypes, we used a second criterion to assess the accuracy of computational algorithms in haplotype frequency estimation: the change coefficient C, defined in Tishkoff et al. [57] as
This coefficient measures the percentage change in haplotype frequencies across the two information conditions (phaseknown versus phaseunknown data). C coefficients were computed for each possible haplotype in each population. The value of C ranges from 0 to 1, with 0 indicating that the estimated and observed frequency are identical. The maximal value of 1 indicates that molecular haplotyping showed either the presence of a haplotype that was assigned a zero through computational haplotyping, or vice versa.
Measure of pairwise LD between SNP markers
We used the phaseknown data to quantify the amount of LD between all pairs of polymorphic sites by computing the correlation coefficient r^{2} [58] for each population sample separately. These statistics are expected to be 1 (perfect LD) when the variation is segregating in a population as only two distinct haplotypes. Statistical significance of LD between pairs of sites was assessed by Fisher's exact tests. Computations were performed with the software PowerMarker v3.21 [59], and a graphical summary of disequilibrium matrices was displayed by the GOLD program [60].
Abbreviations
 SNP:

Single nucleotide polymorphism
 EM:

ExpectationMaximisation algorithm
 LD:

Linkage disequilibrium
 NAT2:

Nacetyltransferase 2
Declarations
Authors’ Affiliations
References
 Minchin RF, Reeves PT, Teitel CH, McManus ME, Mojarrabi B, Ilett KF, Kadlubar FF: Nand Oacetylation of aromatic and heterocyclic amine carcinogens by human monomorphic and polymorphic acetyltransferases expressed in COS1 cells. Biochem Biophys Res. 1992, 185: 839844. 10.1016/0006291X(92)91703S.View ArticleGoogle Scholar
 Hein DW, Doll MA, Rustan TD, Gray K, Feng Y, Ferguson RJ, Grant DM: Metabolic activation and deactivation of arylamine carcinogens by recombinant human NAT1 and polymorphic NAT2 acetyltransferases. Carcinogenesis. 1993, 14: 16331638.View ArticlePubMedGoogle Scholar
 Hein DW: Molecular genetics and function of NAT1 and NAT2: role in aromatic amine metabolism and carcinogenesis. Mutat Res. 2002, 506–507: 6577.View ArticlePubMedGoogle Scholar
 Meisel P: Arylamine Nacetyltransferases and drug response. Pharmacogenomics. 2002, 3: 349366. 10.1517/14622416.3.3.349.View ArticlePubMedGoogle Scholar
 Delomenie C, Sica L, Grant DM, Krishnamoorthy R, Dupret JM: Genotyping of the polymorphic Nacetyltransferase (NAT2*) gene locus in two native African populations. Pharmacogenetics. 1996, 6: 177185.View ArticlePubMedGoogle Scholar
 Asprodini EK, Zifa E, Papageorgiou I, Benakis A: Determination of Nacetylation phenotyping in a Greek population using caffeine as a metabolic probe. Eur J Drug Metab Pharmacokinet. 1998, 23: 501506.View ArticlePubMedGoogle Scholar
 Cartwright RA, Glashan RW, Rogers HJ, Ahmad RA, BarhamHall D, Higgins E, Kahn MA: Role of Nacetyltransferase phenotypes in bladder carcinogenesis: a pharmacogenetic epidemiological approach to bladder cancer. Lancet. 1982, 2: 842845. 10.1016/S01406736(82)908108.View ArticlePubMedGoogle Scholar
 Risch A, Wallace DM, Bathers S, Sim E: Slow Nacetylation genotype is a susceptibility factor in occupational and smoking related bladder cancer. Hum Mol Genet. 1995, 4: 231236.View ArticlePubMedGoogle Scholar
 Bialecka M, GawronskaSzklarz B, Drozdzik M, Honczarenko K, Stankiewicz J: Nacetyltransferase 2 polymorphism in sporadic Parkinson's disease in a Polish population. Eur J Clin Pharmacol. 2002, 57: 857862. 10.1007/s0022800104154.View ArticlePubMedGoogle Scholar
 Chan DK, Lam MK, Wong R, Hung WT, Wilcken DE: Strong association between Nacetyltransferase 2 genotype and PD in Hong Kong Chinese. Neurology. 2003, 60: 10021005.View ArticlePubMedGoogle Scholar
 Blum M, Grant DM, McBride W, Heim M, Meyer UA: Human arylamine Nacetyltransferase genes: isolation, chromosomal localization, and functional expression. DNA Cell Biol. 1990, 9: 193203.View ArticlePubMedGoogle Scholar
 Ohsako S, Deguchi T: Cloning and expression of cDNAs for polymorphic and monomorphic arylamine Nacetyltransferases from human liver. J Biol Chem. 1990, 265: 46304634.PubMedGoogle Scholar
 Blum M, Demierre A, Grant DM, Heim M, Meyer UA: Molecular mechanism of slow acetylation of drugs and carcinogens in humans. Proc Natl Acad Sci. 1991, 88: 52375241.PubMed CentralView ArticlePubMedGoogle Scholar
 Grant DM, Hughes NC, Janezic SA, Goodfellow GH, Chen HJ, Gaedigk A, Yu VL, Grewal R: Human acetyltransferase polymorphisms. Mutat Res. 1997, 376: 6170.View ArticlePubMedGoogle Scholar
 Upton A, Johnson N, Sandy J, Sim E: Arylamine Nacetyltransferases – of mice, men and microorganisms. Trends Pharmacol Sci. 2001, 22: 140146. 10.1016/S01656147(00)016394.View ArticlePubMedGoogle Scholar
 Vatsis KP, Weber WW, Bell DA, Dupret JM, Evans DA, Grant DM, Hein DW, Lin HJ, Meyer UA, Relling MV, et al: Nomenclature for Nacetyltransferases. Pharmacogenetics. 1995, 5: 117.View ArticlePubMedGoogle Scholar
 Hein DW, Grant DM, Sim E: Update on consensus arylamine Nacetyltransferase gene nomenclature. Pharmacogenetics. 2000, 10: 291292. 10.1097/0000857120000600000002.View ArticlePubMedGoogle Scholar
 Xu CF, Lewis K, Cantone KL, Khan P, Donnelly C, White N, Crocker N, Boyd PR, Zaykin DV, Purvis IJ: Effectiveness of computational methods in haplotype prediction. Hum Genet. 2002, 110: 148156. 10.1007/s0043900106564.View ArticlePubMedGoogle Scholar
 Judson R, Stephens JC, Windemuth A: The predictive power of haplotypes in clinical response. Pharmacogenomics. 2000, 1: 1526. 10.1517/14622416.1.1.15.View ArticlePubMedGoogle Scholar
 Judson R, Stephens JC: Notes from the SNP vs. haplotype front. Pharmacogenomics. 2001, 2: 710. 10.1517/14622416.2.1.7.View ArticlePubMedGoogle Scholar
 McDonald OG, Krynetski EY, Evans WE: Molecular haplotyping of genomic DNA for multiple singlenucleotide polymorphisms located kilobases apart using longrange polymerase chain reaction and intramolecular ligation. Pharmacogenetics. 2002, 12: 9399. 10.1097/0000857120020300000003.View ArticlePubMedGoogle Scholar
 Zschieschang P, Hiepe F, GromnicaIhle E, Roots I, Cascorbi I: Lack of association between arylamine Nacetyltransferase 2 (NAT2) polymorphism and systemic lupus erythematosus. Pharmacogenetics. 2002, 12: 559563. 10.1097/0000857120021000000008.View ArticlePubMedGoogle Scholar
 Agundez JA, Ladero JM, Olivera M, Lozano L, FernandezArquero M, de laConcha EG, DiazRubio M, Benitez J: Nacetyltransferase 2 polymorphism is not related to the risk of advanced alcoholic liver disease. Scand J Gastroenterol. 2002, 37: 99103. 10.1080/003655202753387437.View ArticlePubMedGoogle Scholar
 Loktionov A, Moore W, Spencer SP, Vorster H, Nell T, O'Neill IK, Bingham SA, Cummings JH: Differences in Nacetylation genotypes between Caucasians and Black South Africans: implications for cancer prevention. Cancer Detect Prev. 2002, 26: 1522. 10.1016/S0361090X(02)000107.View ArticlePubMedGoogle Scholar
 Cascorbi I, Roots I: Pitfalls in Nacetyltransferase 2 genotyping. Pharmacogenetics. 1999, 9: 123127.View ArticlePubMedGoogle Scholar
 Dandara C, Masimirembwa CM, Magimba A, Kaaya S, Sayi J, Sommers de K, Snyman JR, Hasler JA: Arylamine Nacetyltransferase (NAT2) genotypes in Africans: the identification of a new allele with nucleotide changes 481C>T and 590G>A. Pharmacogenetics. 2003, 13: 5558. 10.1097/0000857120030100000008.View ArticlePubMedGoogle Scholar
 Anitha A, Banerjee M: Arylamine Nacetyltransferase 2 polymorphism in the ethnic populations of South India. Int J Mol Med. 2003, 11: 125131.PubMedGoogle Scholar
 Martinez C, Agundez JA, Olivera M, Martin R, Ladero JM, Benitez J: Lung cancer and mutations at the polymorphic NAT2 gene locus. Pharmacogenetics. 1995, 5: 207214.View ArticlePubMedGoogle Scholar
 Agundez JA, Olivera M, Martinez C, Ladero JM, Benitez J: Identification and prevalence study of 17 allelic variants of the human NAT2 gene in a white population. Pharmacogenetics. 1996, 6: 423428.View ArticlePubMedGoogle Scholar
 Martinez C, Agundez JA, Olivera M, Llerena A, Ramirez R, Hernandez M, Benitez J: Influence of genetic admixture on polymorphisms of drugmetabolizing enzymes: analyses of mutations on NAT2 and C gamma P2E1 genes in a mixed Hispanic population. Clin Pharmacol Ther. 1998, 63: 623628. 10.1016/S00099236(98)900856.View ArticlePubMedGoogle Scholar
 Lee SY, Lee KA, Ki CS, Kwon OJ, Kim HJ, Chung MP, Suh GY, Kim JW: Complete sequencing of a genetic polymorphism in NAT2 in the Korean population. Clin Chem. 2002, 48: 775777.PubMedGoogle Scholar
 Cascorbi I, Drakoulis N, Brockmoller J, Maurer A, Sperling K, Roots I: Arylamine Nacetyltransferase (NAT2) mutations and their allelic linkage in unrelated Caucasian individuals: correlation with phenotypic activity. Am J Hum Genet. 1995, 57: 581592.PubMed CentralView ArticlePubMedGoogle Scholar
 Mrozikiewicz PM, Cascorbi I, Brockmoller J, Roots I: Determination and allelic allocation of seven nucleotide transitions within the arylamine Nacetyltransferase gene in the Polish population. Clin Pharmacol Ther. 1996, 59: 376382. 10.1016/S00099236(96)901046.View ArticlePubMedGoogle Scholar
 Aynacioglu AS, Cascorbi I, Mrozikiewicz PM, Roots I: Arylamine Nacetyltransferase (NAT2) genotypes in a Turkish population. Pharmacogenetics. 1997, 7: 327331.View ArticlePubMedGoogle Scholar
 Meisel P, Schroeder C, Wulff K, Siegmund W: Relationship between human genotype and phenotype of Nacetyltransferase (NAT2) as estimated by discriminant analysis and multiple linear regression:1. Genotype and Nacetylation in vivo. Pharmacogenetics. 1997, 7: 241246.View ArticlePubMedGoogle Scholar
 Kukongviriyapan V, Prawan A, Tassaneyakul W, AiemsaArd J, Warasiha B: Arylamine Nacetyltransferase2 genotypes in the Thai population. Br J Clin Pharmacol. 2003, 55: 278281. 10.1046/j.13652125.2003.01766.x.PubMed CentralView ArticlePubMedGoogle Scholar
 Niu T: Algorithms for inferring haplotypes. Genet Epidemiol. 2004, 27: 334347. 10.1002/gepi.20024.View ArticlePubMedGoogle Scholar
 Fallin D, Schork NJ: Accuracy of haplotype frequency estimation for biallelic loci, via the expectationmaximization algorithm for unphased diploid genotype data. Am J Hum Genet. 2000, 67: 947959. 10.1086/303069.PubMed CentralView ArticlePubMedGoogle Scholar
 Lin S, Cutler DJ, Zwick ME, Chakravarti A: Haplotype inference in random population samples. Am J Hum Genet. 2002, 71: 11291137. 10.1086/344347.PubMed CentralView ArticlePubMedGoogle Scholar
 Tanaka E, Taniguchi A, Urano W, Nakajima H, Matsuda Y, Kitamura Y, Saito M, Yamanaka H, Saito T, Kamatani N: Adverse effects of sulfasalazine in patients with rheumatoid arthritis are associated with diplotype configuration at the Nacetyltransferase 2 gene. J Rheumatol. 2002, 29: 24922499.PubMedGoogle Scholar
 JorgeNebert LF, Eichelbaum M, Griese EU, Inaba T, Arias TD: Analysis of six SNPs of NAT2 in Ngawbe and Embera Amerindians of Panama and determination of the Embera acetylation phenotype using caffeine. Pharmacogenetics. 2002, 12: 3948. 10.1097/0000857120020100000006.View ArticlePubMedGoogle Scholar
 Barrett JH, Smith G, Waxman R, Gooderham N, Lightfoot T, Garner RC, Augustsson K, Wolf CR, Bishop DT, Forman D: Investigation of interaction between Nacetyltransferase 2 and heterocyclic amines as potential risk factors for colorectal cancer. Carcinogenesis. 2003, 24: 275282. 10.1093/carcin/24.2.275.View ArticlePubMedGoogle Scholar
 Clark AG: Inference of haplotypes from PCRamplified samples of diploid populations. Mol Biol Evol. 1990, 7: 111122.PubMedGoogle Scholar
 Stephens M, Smith NJ, Donnelly P: A new statistical method for haplotype reconstruction from population data. Am J Hum Genet. 2001, 68: 978989. 10.1086/319501.PubMed CentralView ArticlePubMedGoogle Scholar
 Agundez JA, Olivera M, Ladero JM, RodriguezLescure A, Ledesma MC, DiazRubio M, Meyer UA, Benitez J: Increased risk for hepatocellular carcinoma in NAT2slow acetylators and CYP2D6rapid metabolizers. Pharmacogenetics. 1996, 6: 501512.View ArticlePubMedGoogle Scholar
 Wang L, Xu Y: Haplotype inference by maximum parsimony. Bioinformatics. 2003, 19: 17731780. 10.1093/bioinformatics/btg239.View ArticlePubMedGoogle Scholar
 Dempster AP, Laird NM, Rubin DB: Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc. 1977, 39: 138.Google Scholar
 Excoffier L, Slatkin M: Maximumlikelihood estimation of molecular haplotype frequencies in a diploid population. Mol Biol Evol. 1995, 12: 921927.PubMedGoogle Scholar
 Hawley ME, Kidd KK: HAPLO: a program using the EM algorithm to estimate the frequencies of multisite haplotypes. J Hered. 1995, 86: 409411.PubMedGoogle Scholar
 Long JC, Williams RC, Urbanek M: An EM algorithm and testing strategy for multiplelocus haplotypes. Am J Hum Genet. 1995, 56: 799810.PubMed CentralPubMedGoogle Scholar
 Schneider S, Roessli D, Excoffier L: Arlequin ver. 2.000: a software for population genetics data analysis. 2000, Genetics and Biometry Laboratory, University of Geneva, SwitzerlandGoogle Scholar
 Qin ZS, Niu T, Liu JS: Partitionligationexpectationmaximization algorithm for haplotype inference with singlenucleotide polymorphisms. Am J Hum Genet. 2002, 71: 12421247. 10.1086/344207.PubMed CentralView ArticlePubMedGoogle Scholar
 Niu T, Qin ZS, Xu X, Liu JS: Bayesian haplotype inference for multiple linked singlenucleotide polymorphisms. Am J Hum Genet. 2002, 70: 157169. 10.1086/338446.PubMed CentralView ArticlePubMedGoogle Scholar
 Stephens M, Donnelly P: A comparison of bayesian methods for haplotype reconstruction from population genotype data. Am J Hum Genet. 2003, 73: 11621169. 10.1086/379378.PubMed CentralView ArticlePubMedGoogle Scholar
 Hudson RR: Gene genealogies and the coalescent process. Oxford surveys in evolutionary biology. Edited by: Futuyma D, Antonovics J. 1991, Oxford University Press, Oxford, 7: 144.Google Scholar
 Renkonen O: Statischökologische Untersuchungen über die terrestiche Kaferwelt der finnishen Bruchmoore. Ann Zool Soc Bot Fenn Vanamo. 1938, 6: 1231.Google Scholar
 Tishkoff SA, Pakstis AJ, Ruano G, Kidd KK: The accuracy of statistical methods for estimation of haplotype frequencies: an example from the CD4 locus. Am J Hum Genet. 2000, 67: 518522. 10.1086/303000.PubMed CentralView ArticlePubMedGoogle Scholar
 Devlin B, Risch N: A comparison of linkage disequilibrium measures for finescale mapping. Genomics. 1995, 29: 311322. 10.1006/geno.1995.9003.View ArticlePubMedGoogle Scholar
 Liu K, Muse S: PowerMarker: new genetic data analysis software. Version 3.0. Free program distributed by the author over the internet. [http://www.powermarker.net]
 Abecasis GR, Cookson WO: GOLD – graphical overview of linkage disequilibrium. Bioinformatics. 2000, 16: 182183. 10.1093/bioinformatics/16.2.182.View ArticlePubMedGoogle Scholar
 Stephens M, Smith NJ, Donnelly P: Reply to Zhang etal. Am J Hum Genet. 2001, 69: 912914. 10.1086/323623.PubMed CentralView ArticleGoogle Scholar
 Osier M, Pakstis AJ, Kidd JR, Lee JF, Yin SJ, Ko HC, Edenberg HJ, Lu RB, Kidd KK: Linkage disequilibrium at the ADH2 and ADH3 loci and risk of alcoholism. Am J Hum Genet. 1999, 64: 11471157. 10.1086/302317.PubMed CentralView ArticlePubMedGoogle Scholar
 Zhao H, Pfeiffer R, Gail MH: Haplotype analysis in population genetics and association studies. Pharmacogenomics. 2003, 4: 171178. 10.1517/phgs.4.2.171.22636.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.