Empirical vs Bayesian approach for estimating haplotypes from genotypes of unrelated individuals

Background The completion of the HapMap project has stimulated further development of haplotype-based methodologies for disease associations. A key aspect of such development is the statistical inference of individual diplotypes from unphased genotypes. Several methodologies for inferring haplotypes have been developed, but they have not been evaluated extensively to determine which method not only performs well, but also can be easily incorporated in downstream haplotype-based association analyses. In this paper, we attempt to do so. Our evaluation was carried out by comparing the two leading Bayesian methods, implemented in PHASE and HAPLOTYPER, and the two leading empirical methods, implemented in PL-EM and HPlus. We used these methods to analyze real data, namely the dense genotypes on X-chromosome of 30 European and 30 African trios provided by the International HapMap Project, and simulated genotype data. Our conclusions are based on these analyses. Results All programs performed very well on X-chromosome data, with an average similarity index of 0.99 and an average prediction rate of 0.99 for both European and African trios. On simulated data with approximation of coalescence, PHASE implementing the Bayesian method based on the coalescence approximation outperformed other programs on small sample sizes. When the sample size increased, other programs performed as well as PHASE. PL-EM and HPlus implementing empirical methods required much less running time than the programs implementing the Bayesian methods. They required only one hundredth or thousandth of the running time required by PHASE, particularly when analyzing large sample sizes and large umber of SNPs. Conclusion For large sample sizes (hundreds or more), which most association studies require, the two empirical methods might be used since they infer the haplotypes as accurately as any Bayesian methods and can be incorporated easily into downstream haplotype-based analyses such as haplotype-association analyses.


Background
The completion of the HapMap project has stimulated further interest in haplotype inference from un-phased single nucleotide polymorphism (SNP) genotypes [1].
Recent evidence indicates the human genome has hot spots and cold spots for recombination, and it is divided into multiple haplotype blocks, each of which has only a limited number of haplotypes [2][3][4][5]. Such haplotype block structure in the human genome suggests that haplotype-based methods may play an important role in genetic association studies [6,7]. Haplotypes can be generated experimentally by dissecting out single chromosomes and inserting the entire chromosome into a yeast artificial chromosome [8] or by using rodent-human hybrid techniques to physically separate two chromosomes [5]. However, both technologies are experimentally challenging and cost prohibitive for use in population research at this time. The most commonly used technologies generate unphased SNP genotypes. One way to resolve individual haplotypes is via family data, which is expensive to collect [9]. Another option is to resolve individual haplotypes statistically. Clark's heuristic algorithm is probably among the first statistical methods for inferring haplotypes from genotypes of unrelated individuals [10].
Many maximum likelihood methods have been developed, and almost all share the same scientific objectives and likelihood framework. The fundamental difference among these methods is a prior assumption for the distribution of haplotypes: methods with prior assumption for distribution of haplotypes are referred to as Bayesian methods, and the methods without any prior assumption are called empirical methods.
Estimation-maximization (EM) algorithm, a maximum likelihood method, was first introduced to infer haplotypes from unrelated individuals [11][12][13], but those earlier works were computationally demanding when processing large number of SNPs. More recently Qin et al. [14] discussed a new strategy for estimating haplotype frequencies using the EM algorithm, which largely improved performance, especially when analyzing data with large numbers of SNPs. Li et al. [15] have applied the estimating equation (EE) technique and further improved the statistical and computational efficiency in the estimation of haplotype frequencies and their standard errors. Both EM and EE methods are empirical methods.
Stephens et al. were probably among the first groups to propose a model-based Bayesian method [16] under the assumption of coalescence of haplotypes. Later it was modified to improve statistical and computational efficiency [17,18]. Niu et al. [19] took a Bayesian approach but chose a Dirichlet distribution for haplotypes as their prior, and published a computational algorithm to handle a large number of SNPs, which was referred to as partition-ligation (PL).
Which method performs the best? Several papers have attempted to address this question and their conclusions are not without controversy [14,[19][20][21][22]. These papers compared the performances of haplotyping methods based on a limited number of available haplotype data sets and some simulated data. Recently, Marroni et al. [23] used genotype data provided by Illumina and Affymetrix for Genetic Analysis Workshop 14. The data include genotypes of 104 mother-son pairs with Caucasian ancestry on 313 SNPs of the X-chromosome. Because males have only one copy of the X-chromosome, mother haplotypes can be resolved from their sons' genotypes. Instead of evaluating the performances of the methods for analyzing SNPs within haplotype blocks, Marroni et al. investigated the 14 series of unphased genotypes of 5 or 10 SNPs with different values of linkage disequilibrium (LD). In this paper, we used the dense genotypes on the X-chromosome of 30 European and 30 African trios provided by the International HapMap project. The X-chromosome was chosen because mother haplotypes can be unambiguously resolved from the genotypes of trios. Resolved haplotypes were divided into blocks using the Haploview program [24], providing abundant haplotype data sets. Among the identified haplotype blocks, we randomly selected 500 blocks to evaluate haplotyping method performances. To evaluate the performances of haplotype methods on the data with larger sample sizes, we conducted some simulation studies under different scenarios. In our first set of simulations, we generated haplotypes based on real data on the X-chromosome. In our second set of simulations, we generated haplotypes using Hudson's coalescent program [25] to investigate how much efficiency is gained by assuming coalescence prior in PHASE compared to empirical methods without assuming a prior. Programs used for comparisons are PHASE (version 2.1) [26] for the model-based Bayesian method [20], HAPLOTYPER [27] for the empirical Bayesian method [19], PL-EM [28] for the EM method [14], and HPlus [29] for the EE method [15]. Because the accurate estimation of haplotype frequencies and inference of individual haplotypes are both critical in assessing haplotype association with disease phenotypes [30][31][32][33][34], our comparisons focus on evaluating method performance from these two angles.

HapMap Trio Data
We used X-chromosome genotype data of 30 European and 30 African trios from the HapMap project [1]. With trio data, mother haplotypes can be resolved unambiguously from her offspring's and the father of her offspring's genotype data. The mother's two chromosomes are separated at each locus as transmitted or not transmitted to her child. If the child is male, the child's X-chromosome is transmitted from his mother, therefore mother's allele that matches the child's allele is the transmitted allele. If the child is female, one chromosome is from her mother and the other is from her father. Using the father's allele on the X-chromosome, we can deduce which allele is transmitted from the mother. Hence, 30 mother haplo-type pairs are determined by sorting out transmitted alleles from untransmitted alleles, and these sixty (= 30 × 2) represent the true haplotypes, which are not readily obtainable for any autosome chromosomes.
Applying this procedure to the HapMap Phase II data (July 2005 release), we obtained phase-resolved X-chromosome SNP data. We further divided these SNPs into haplotype blocks using Haploview software [24], by specifying for the method described in [35], the parameters of confidence interval minima 0.8 and 0.5 for strong LD, upper confidence interval maximum 0.6 for strong recombination, the fraction of strong LD in informative comparisons to be at least 0.95, and excluding the SNPs with MAF less than 0.05. We chose the parameters to be less stringent than default values to get larger blocks that are still in high LD. The block identification was done separately for European and African mother haplotypes. Within each population, we randomly chose 500 haplotype blocks to compare haplotyping methods. Among the 500 European mother haplotype blocks, the number of SNPs ranges from 2 to 195 with mean of 13 and median of 7. Among the 500 African mother haplotype blocks, the number of SNPs ranges from 2 to 33 with mean of 5 and median of 3. It had been shown that African populations have shorter haplotype blocks than European populations [1].

Notations
Consider a sample of n unrelated individuals from a study population. From each individual, we observe q SNP-genotypes on a specific region in the genome. Let = (g i1 , …, g iq ) denote the qSNP-genotypes for the ith individual.
When genotype g ij is heterozygous, the phase (parental origin of the two alleles) becomes ambiguous and has two solutions denoted by p ij . Let = (p i1 , …, p iq ) denote the phase of . Given phase , genotype uniquely determines a diplotype (a pair of compatible haplotypes), . Therefore, for a genotype with m heterozygous loci, there are 2 m-1 possible resolutions for phase and diplotypes. Let denote population haplotype frequencies where T is the total number of haplotypes.
All methods compared in this paper use the maximum likelihood approach or its variation. They all shared the same likelihood function of haplotype frequency = (θ 1 , θ 2 , …, θ T ), which can be written as where f(H ij | ) is the probability of haplotype H ij given the population's haplotype frequencies; f( ) is the prior probability of phase.

Empirical Methods
The estimation-maximization (EM) algorithm was used to obtain maximum likelihood estimates of haplotype frequencies, [11]. To avoid trapping in a local maximum, the programs implementing EM algorithm require multiple initial values to ensure the global maximum. Excoffier and Slatkin [11] used bootstrapping to estimate standard errors of estimates of haplotype frequencies, and implemented the method in ARLEQIN. Qin et al. [14] implemented Louis' method [36] and implemented the method in PL-EM. Applying estimation equation technique, Li et al. [15] efficiently estimated the haplotype frequencies and their standard errors and implemented the method in HPlus.

Bayesian methods
Different from the empirical approaches described above, the model-based Bayesian method [16] further assumes that haplotypes are coalescent so future-sampled individual haplotypes H i is assumed to be more similar to the previously sampled haplotypes, H -i [37]. This Bayesian method was implemented in PHASE software program. Another Bayesian method [19] assumes that prior distribution of haplotype frequency follows a Dirichlet distribution with hyperparameter = (β 1 , …, β T . Using Gibbs sampling algorithm, Niu et al. sampled a pair of compatible haplotypes for each individual and estimate the haplotype frequencies, and this method was implemented in HAPLOTYPER.

Comparison Measurements
Accurately estimating haplotype frequencies and inferring individual haplotypes are both critical in assessing haplotype association with disease phenotypes [30][31][32][33][34]. Here we consider two measures to evaluate the accuracy of haplotype frequency estimates and the inferred individual haplotypes. The first measure is the similarity index [11] defined as , where θ j and are the true and the estimated frequency of the jth haplotype, to measure the overall similarity between the estimated and the sample haplotype frequencies and the value of the similarity index ranges from zero to one. The second 1θ j measure is the prediction rate that measures the percent of correct predictions for all haplotypes from their genotypes compared to the sampled haplotypes. Since HAPLOTY-PER gives only one pair of compatible haplotypes for each individual, we calculated the prediction rate based on the best prediction for each individual. The prediction rate weighted by the posterior probability of a pair of inferred haplotypes was also used to evaluate other programs. The results are similar between the two prediction rates. Running time is recorded to measure the computational efficiency of the implemented algorithms. All computer programs were run under their default or recommended settings on computer with a dual Pentium III 800 MHz with 2 GB RAM.

Genotype data on the X-chromosome from the International HapMap project
All four programs inferred haplotypes with high accuracy from the genotypes on the X-chromosome in the 500 selected European haplotype blocks, but HAPLOTYPER failed to resolve any results in 18 blocks and PHASE performed poorly in one of the haplotype blocks with similarity index of 0.29 and prediction rate of 0.28. The mean similarity index and the mean prediction rate are 0.99 and the medians are 1.0 for all programs ( Table 1) All four programs performed similarly on the 500 African haplotype blocks as they did on the 500 European haplotype blocks ( In general, the performances of all programs are affected by the percentage of heterozygous individuals, since heterozygosis at multiple loci indicates the uncertainty of individual haplotypes. To investigate the impact of this factor, we examined its relationship with similarity index and prediction rate in analyzing the 500 European haplotype blocks (Figure 1). It appears that all programs tended to perform better for the data with a lower percentage of uncertainty. Even with high percentage of uncertainty in the genotype data, all programs still performed with high accuracy. The other impact factor is the LD of SNPs in the blocks that had been investigated in recent paper [23]. Figure 2 shows the relation between performances with the LD of haplotype blocks. Since our focus is to evaluate program performance on haplotype blocks, SNPs are in high LD within blocks. With high LD, all programs perform well, except PHASE, which had a poor performance on one block. Multi-locus LD is measured using the formulation derived in the paper [38].

Simulated Data
In the first set of simulations, we randomly selected three series of SNPs with low LD. For each selected series of SNPs, we estimated frequencies from the 60 haplotypes of the 30 European mothers. Based on these frequencies, we randomly drew haplotypes to form genotypes of individuals. The number of individuals was 100, 150, 200, 250, and 300, respectively. For a given sample size, we then generated 100 replicated data sets and analyzed each data set using all four programs. Table 3 shows the average performance of each program over 100 replicates. The similarity index and prediction rate from analyzing the original data are presented in the first row of each block. It is clear that PHASE is superior to the other programs with respect to performance indices for a small sample size, but when the sample size increases, the other programs, especially PL-EM and HPlus, performed as well as PHASE and sometimes (e.g. in the second selected block) outperformed it on the prediction rate.
We also used Hudson's coalescent program to generate phase-resolved haplotype data. We generated data sets with a mutation rate of 4 (= 4N e θ) and sample sizes of 100, 150, 200, and 250, respectively. For each sample size, we repeated simulations 100 times. Table 4 lists the average performance indices for each program and sample size. For all sample sizes, PHASE consistently performed better than all other programs with respect to both similarity index and prediction rate. This result supports the notion that when the modeling assumption is valid, PHASE is more efficient than other methods (empirical Bayesian or empirical methods). However, it is important to note that the differences between PHASE and others become less marked with larger sample sizes. This result was expected because the gain by PHASE due to the coa-lescent assumption diminishes and the likelihood methods approach their full efficiency with increased sample sizes. We also conducted simulation studies with different coalescent model parameters, and the results (not shown) are largely comparable to those shown in Table 4.
In both Tables 3 and 4, average running times are recorded on the far right for comparison purposes. For all simulations, PHASE requires much more computational time than others and HPlus requires the least computational time among the four.

Discussion
The key difference between the Bayesian and empirical methods compared in this paper is the use of priors (the approximate coalescent prior by PHASE and the Dirichlet prior by HAPLOTYPER). If the prior approximates real haplotype data, Bayesian methods gain some efficiency using the prior. On the other hand, efficiency may be lost because of a wrong prior. The influence of the prior is non-negligible when the sample size is small. In this case, because the real haplotypes tend to coalesce, PHASE using the approximate coalescent prior is likely to produce more efficient estimates, and HAPLOTYPER using the Dirichlet prior may produce less efficient estimates than the empirical methods. This phenomenon was observed when inferring haplotypes from the simulated genotypes using Hudson's coalescent program [25]. However, the superior performance of PHASE diminishes when the sample size increases (Table 4). For genotype data with low LD, PHASE using the approximate coalescent prior would gain some efficiency when the sample size is small, such as 30, which we investigated here, and 104, which Marroni et al. [23] investigated. However, when the sample increases to 150 or larger, PL-EM and HPlus implement- The relationship between the performances of haplotyping methods and the percentage of individuals with uncertainty haplo-types Figure 1 The relationship between the performances of haplotyping methods and the percentage of individuals with uncertainty haplotypes. The plots illustrate for the performances (in similarity Index and Prediction Rate) of empirical methods (PL-EM and HPlus) and Bayesian methods (PHASE and HAPLOTYPER) on analyzing the 500 randomly selected haplotype blocks of the 30 European mothers' genotypes on X-chromosome.
The relationship between the performances of haplotyping methods and the linkage disequilibrium (LD) of the haplotypes within blocks Figure 2 The relationship between the performances of haplotyping methods and the linkage disequilibrium (LD) of the haplotypes within blocks. The plots illustrate for the performances (in similarity Index and Prediction Rate) of empirical methods (PL-EM and HPlus) and Bayesian methods (PHASE and HAPLOTYPER) on analyzing the 500 randomly selected haplotype blocks of the 30 European mothers' genotypes on X-chromosome.
ing empirical methods can perform as well as PHASE (Table 3).
Recently, Kimmel and Shamir [39] developed a new likelihood method to infer haplotypes and identify haplotype blocks. Their likelihood uses not only the parameters of haplotype frequencies but also the parameters of the probability of observing a variant allele in each locus and each haplotype. Using the EM algorithm, they estimated haplotype frequencies. It deserves debate whether this new likelihood is better than the one used in most methods. In terms of performance, Kimmel and Shamir [39] claimed that PHASE performed slightly better using its default setting and was a hundred times slower than GER-

Conclusion
The recent advent of genotyping technologies is rapidly transforming genetic association studies by providing more SNPs (more than 500,000 SNPs) on arrays and by reducing the cost of genotyping individual samples (around $500~1000 per sample). The next-generation genome wide studies will likely use several hundreds or thousands of SNPs on hundreds or thousands of individuals. To gain both statistical and computational efficiency, haplotype-based analyses will be increasingly used, especially for those regions with high LD. With such massive data, as we show in this paper, the empirical methods such as EM and EE can infer haplotypes as accurately as a time-consuming method, such as the model-based Bayesian method that PHASE implement, and they can be easily incorporated into downstream haplotype-based analyses. The empirical methods had already been used in many haplotype-based association methods [32,34,40].