- Research article
- Open Access
Estimating haplotype frequencies in pooled DNA samples when there is genotyping error
© Quade et al; licensee BioMed Central Ltd. 2005
- Received: 03 September 2004
- Accepted: 19 May 2005
- Published: 19 May 2005
Maximum likelihood estimates of haplotype frequencies can be obtained from pooled DNA using the expectation maximization (EM) algorithm. Through simulation, we investigate the effect of genotyping error on the accuracy of haplotype frequency estimates obtained using this algorithm. We explore model parameters including allele frequency, inter-marker linkage disequilibrium (LD), genotyping error rate, and pool size.
Pool sizes of 2, 5, and 10 individuals achieved comparable levels of accuracy in the estimation procedure. Common marker allele frequencies and no inter-marker LD result in less accurate estimates. This pattern is observed regardless of the amount of genotyping error simulated.
Genotyping error slightly decreases the accuracy of haplotype frequency estimates. However, the EM algorithm performs well even in the presence of genotyping error. Overall, pools of 2, 5, and 10 individuals yield similar accuracy of the haplotype frequency estimates, while reducing costs due to genotyping.
- Linkage Disequilibrium
- Similarity Index
- Pool Sample
- Pool Size
- Haplotype Frequency
Association studies offer several advantages to linkage analysis for mapping susceptibility loci in complex diseases. They may be more powerful than linkage analysis for loci with a small effect, since the excess sharing across families is expected to be greater than the excess sharing within a family (identity-by-descent (IBD)) . In addition, association studies are expected to provide greater precision in pinpointing the location of susceptibility loci. Finally, association studies do not require the collection of groups of relatives or extended pedigrees, which can be challenging – particularly for late onset diseases.
However, even for association studies, the large sample sizes necessary to study the genetics of complex disease appear unavoidable, so recent interest has focused on methods to reduce the cost. One approach is to use diallelic nucleotide bases, or single nucleotide polymorphisms (SNPs), to help identify susceptibility genes . SNPs are abundantly available in the human genome (approximately 1 per kb of DNA) , providing a plentiful source in the genome from which to choose. Additionally, SNP genotyping can be completely automated, and recent technologies have decreased the time necessary to perform the genotyping (as reviewed by Syvanen 2001) . As a result, SNPs are relatively easy, fast, and inexpensive to genotype compared to other existing technologies, such as microsatellite markers (e.g. [5, 6]). A second approach to reduce the cost of genotyping is to use DNA pooling, where equal amounts of DNA from each of a group of individuals are combined and then genotyping is performed on the pool instead of on each individual's DNA separately. This procedure has the potential to substantially reduce the genotyping costs, since, if the pools are formed from k individuals, the genotyping costs will be reduced to (100/k)% of the cost of genotyping each individual.
Unfortunately, SNPs are relatively uninformative individually, i.e. more than one is required to obtain an amount of information equivalent to more informative markers, such as microsatellites. One way to increase the information from SNPs is to use haplotypes constructed from multiple SNPs, which is more powerful for detecting an association than using all SNPs individually . The expectation-maximization (EM) algorithm has been implemented to obtain maximum likelihood estimates of haplotype frequencies for pooled data [8, 9]. These studies show that the algorithm provides accurate estimates of the haplotype frequencies when no genotyping error is present.
More realistically, genotyping errors do occur, which can have implications for the accuracy of haplotype frequency estimates from pooled samples. In this paper we investigate the effect of genotyping error on 2-SNP haplotype frequency estimates obtained using the EM algorithm for pooled data. We show that the algorithm performs well even in the presence of genotyping error, compared to estimates obtained when there is no genotyping error present.
To evaluate the performance of the EM algorithm, the following parameters were examined: number of individuals per pool, sample size, marker allele frequency, and strength of inter-marker linkage disequilibrium. All parameters were evaluated for scenarios with and without genotyping error.
Number of individuals (k) per pool
Marker allele frequency
Interest continues to increase in association analysis for complex genetic traits; however, this study design is still not without shortcomings. Therefore, we evaluated the effects of genotyping error on the estimates of haplotype frequencies when pooling DNA for association studies and, more specifically, we assessed the benefits (if any) to be gained from it. Additionally, we investigated the effects of pool size, marker allele frequencies, and LD on the accuracy of the haplotype frequency estimates.
We have shown that accuracy of the haplotype frequency estimates decreases as the level of genotyping error increases. However, this decrease is small and, even in the presence of genotyping error, the estimates of the haplotype frequencies are accurate. Ideally, it would be most beneficial to design studies with a large number of individuals per pool to minimize the genotyping costs. We observed, under all genotyping error levels, that pools of size 2, 5, and 10 all achieve about the same level of accuracy. This suggests that pool sizes of 10 individuals could be used to obtain accurate estimates. Using a larger number of individuals per pool and still obtaining the same level of accuracy allows for an even greater reduction in the cost of genotyping compared to situations that utilize a smaller number of individuals per pool. Additionally, we observed that a sample size of 500 is just as accurate as a sample size of 1000 for the models simulated. Therefore this further supports the notion that less genotyping can be performed and yet the same level of accuracy obtained for haplotype frequency estimates.
We observed that marker allele frequencies and the amount of LD can have an effect on the accuracy of the haplotype frequency estimates. If a rare allele frequency is present in the pool and/or in cases of stronger LD, in the absence of genotyping error more accurate estimates are obtained. Similarly, this same pattern was observed in the presence of genotyping error. For the case of individual genotyped unrelated samples, Kirk and Cardon (2002)  evaluated the EM algorithm to estimate haplotype frequencies in the presence of genotyping errors using a much smaller sample size of 50. These authors observed in situations of high LD and/or when rare alleles are present that the EM algorithm offered a high degree of accuracy even in the presence of genotyping errors. In situations with low LD and/or common alleles, the EM algorithm performs more poorly for both individual and for pooled designs.
To date, the most common pooling strategy has been to create one large pool for each condition (e.g., case and control status), and to compare allele frequencies among pools (e.g. [11, 12]). This strategy would certainly result in the greatest efficiency in genotyping, but at the cost that individual haplotype frequencies cannot be estimated. Under this strategy, Le Hellard et al. (2002) evaluated several quantitative SNP genotyping methods for pooled samples and compared the true allele frequencies, obtained by genotyping each sample individually, to those estimated from the pooled sample. Although errors are present when estimating the allele frequencies from pooled samples, pooling provided reasonably accurate estimates, even for these very large pools. In a comprehensive review of DNA pooling, Sham (2002)  concludes that pooling can be considered both cost and time effective. It remains to be determined whether the cost efficiency gained by forming large pools to reduce the amount of genotyping outweighs the statistical efficiency gained by performing haplotype analysis using smaller pools.
In this analysis we chose to introduce error into genotypes because we were evaluating pooled samples. However, there are several other types of error models available for individual genotyped samples (e.g., [15, 16]). Therefore, under a different error model it is possible that the conclusions reached in this analysis might have differed. Ultimately we would like to account for the genotyping error when estimating haplotype frequencies for pooled samples, just as Zou & Zhao (2003)  have done for individual genotyped samples.
To assess the accuracy of our results we chose the similarity index because this measure sums across all haplotypes frequencies. However, we could have chosen to evaluate the estimated haplotype frequencies individually, as in Zou & Zhao (2003) . Therefore, evaluating haplotypes using a different measure could result in different conclusions. For example, in Zou & Zhao (2003)  the authors report four estimated haplotype frequencies to be 0.366, 0.126, 0.132, and 0.376 where the true frequencies are 0.4, 0.1, 0.1, and 0.4, respectively. Based on this, the authors note the highest change in haplotype frequency estimates to be 30% (this is from an estimated frequency of 0.132 where the true frequency is 0.1). However, for this example the similarity index, which takes all four haplotypes into account, is 0.94.
For this analysis we only chose to evaluate two-marker loci; however, our method can be extended to accommodate many marker loci. For individual samples, as the number of markers increases there is a loss of accuracy in the haplotype frequency estimates. It is possible that this loss of accuracy could be even more severe for pooled samples.
Genotyping error may have an impact on the detection of false positive or false negative signals in genetic association studies, or on the sample size needed to detect an association when using DNA pools. Gordon et al (2002)  quantify the effects that individual genotyping errors have on power and required sample size for case-control genetic association studies. They report that genotyping errors increase the likelihood of missing a real effect. Similarly, Zou & Zhao (2004)  evaluate the impact of genotyping errors on false discovery rates for individual genotyping and the impact of measurement errors or pool formation errors for pooled genotyping. They report that genotyping errors can lead to a higher rate of false positives for individual genotyping and even higher measurement errors for pooled samples.
Here we only consider the accuracy of the EM algorithm for pooled samples to estimate the haplotype frequencies in the presence of genotyping error and do not evaluate the sample size necessary to detect an association. Therefore, even though we find pool sizes as large as 10 and a sample size of 500 to be efficient for estimating haplotype frequencies, we cannot comment on the effect of genotyping errors on the ability to find false positive or negative associations.
When using the EM algorithm for pooled samples, we found that genotyping error slightly decreases the accuracy of haplotype frequency estimates. However, the EM algorithm still performs well even in the presence of genotyping error. Overall, pools of 2, 5, and 10 individuals yield similar accuracy of the haplotype frequency estimates, likewise for sample sizes of 500 and 1000 individuals. Therefore, we can conclude that the overall amount of genotyping can be reduced by using 10 individuals per pool with sample sizes as small as 500 individuals.
Data were simulated both with and without genotyping error for each pool size (k) under 198 genetic models using combinations of different allele frequencies at each locus (0.01, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9 and 0.99) and strength of LD between the markers (between 0 and 0.25). For simplicity, we only considered the case with two loci and two alleles, Ai and Bi for locus i, but this method can be extended to accommodate more marker loci. LD was measured as δmax, calculated as the smaller of pAqB or qApB, where pA and pB represent allele frequencies ranging from 0.01–0.99. For each combination of allele frequencies, 3 situations were simulated with δ equal to 0, δmax/2, and δmax. Throughout our study, we assume Hardy-Weinberg equilibrium. We considered pool sizes (k) of 1,2,5, and 10 individuals, and total sample sizes (N) of 500 and 1000 individuals.
Conditional probabilities* of observed genotypes, given the true genotype of pool j. For k = 2 individuals with SNP locus AB in pool j.
Observed Unordered Genotypes/Number of A Alleles in Pool j | True Unordered Genotype AABB
Based on the value of α we introduce error into genotypes by simulating three levels of genotyping error, which are defined as no genotyping error (σ2 = 0), intermediate genotyping error (σ2 = 0.01) which corresponds to a realistic level of genotyping error based on the results of LeHellard et al. (2002) , and the maximum possible genotyping error given our error distribution (σ2 = 0.50–5.18, depending on the value of k)
Estimation of haplotype frequencies via the EM algorithm for pooled samples
Wang et al. (2003)  and Ito et al. (2003)  independently developed algorithms to estimate haplotype frequencies utilizing an EM algorithm for pooled data. Both of these algorithms infer the estimates utilizing a maximum likelihood approach that is identical to the approach we have used when analyzing the data under no genotyping error. To investigate whether a global maximum was found, four sets of starting values were used for k = 1,2,5 individuals per pool, and two sets for k = 10, to determine if they obtain the same maximum likelihood estimate. The results we present are those from whichever starting values gave the largest maximum for the haplotype frequency estimates.
Evaluation of haplotype frequency estimates
We compared the estimated haplotype frequencies to the true haplotype frequencies using the similarity index (IF) . If is the estimated haplotype frequency for haplotype i, h is the total number of haplotypes, and pi is the true haplotype frequency, then IF is defined as . The similarity index takes on values between 0 and 1 and is close to 0 when none of the estimated haplotype frequencies are close to the true haplotype frequencies, and 1 when all of the estimated haplotype frequencies equal the true haplotype frequencies.
The authors thank two anonymous reviewers for their insightful comments. This work was partially supported by grant HG01577 from the National Human Genome Research Institute, grant GM2835 from the National Institute of General Medical Sciences, and grant RR03655 from the National Center for Research Resources.
- Risch N, Merikangas K: The future of genetic studies of complex human diseases. Science. 1996, 273: 1516-1517.View ArticlePubMedGoogle Scholar
- Nowotny P, Kwon JM, Goate AM: SNP analysis to dissect human traits. Curr Opin Neurobiol. 2001, 11: 637-641. 10.1016/S0959-4388(00)00261-0.View ArticlePubMedGoogle Scholar
- Wang DG, Fan JB, Siao CJ, Berno A, Young P, Sapolsky R, et al: Large-scale identification, mapping, and genotyping of single- nucleotide polymorphisms in the human genome. Science. 1998, 280: 1077-1082. 10.1126/science.280.5366.1077.View ArticlePubMedGoogle Scholar
- Syvanen AC: Accessing genetic variation: genotyping single nucleotide polymorphisms. Nat Rev Genet. 2001, 2: 930-942. 10.1038/35103535.View ArticlePubMedGoogle Scholar
- Perlin MW, Lancia G, Ng SK: Toward fully automated genotyping: genotyping microsatellite markers by deconvolution. Am J Hum Genet. 1995, 57: 1199-1210.PubMed CentralPubMedGoogle Scholar
- Deloukas P, Schuler GD, Gyapay G, Beasley EM, Soderlund C, Rodriguez-Tome P, et al: A physical map of 30,000 human genes. Science. 1998, 282: 744-746. 10.1126/science.282.5389.744.View ArticlePubMedGoogle Scholar
- Fallin D, Schork NJ: Accuracy of haplotype frequency estimation for biallelic loci, via the expectation-maximization algorithm for unphased diploid genotype data. Am J Hum Genet. 2000, 67: 947-959. 10.1086/303069.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang S, Kidd KK, Zhao H: On the use of DNA pooling to estimate haplotype frequencies. Genet Epidemiol. 2003, 24: 74-82. 10.1002/gepi.10195.View ArticlePubMedGoogle Scholar
- Ito T, Chiku S, Inoue E, Tomita M, Morisaki T, Morisaki H, et al: Estimation of haplotype frequencies, linkage-disequilibrium measures, and combination of haplotype copies in each pool by use of pooled DNA data. Am J Hum Genet. 2003, 72: 384-398. 10.1086/346116.PubMed CentralView ArticlePubMedGoogle Scholar
- Kirk K, Cardon L: The impact of genotyping error on haplotype reconstruction and frequency estimation. European Journal of Human Genetics. 2002, 10: 616-622. 10.1038/sj.ejhg.5200855.View ArticlePubMedGoogle Scholar
- Breen G, Harold D, Ralston S, Shaw D, St Clair D: Determining SNP allele frequencies in DNA pools. Biotechniques. 2000, 28: 464-6. 468,470PubMedGoogle Scholar
- Norton N, Williams NM, Williams HJ, Spurlock G, Kirov G, Morris DW, et al: Universal, robust, highly quantitative SNP allele frequency measurement in DNA pools. Hum Genet. 2002, 110: 471-478. 10.1007/s00439-002-0706-6.View ArticlePubMedGoogle Scholar
- Le Hellard S, Ballereau SJ, Visscher PM, Torrance HS, Pinson J, Morris SW, et al: SNP genotyping on pooled DNAs: comparison of genotyping technologies and a semi automated method for data storage and analysis. Nucleic Acids Res. 2002, 30: e74-10.1093/nar/gnf070.PubMed CentralView ArticlePubMedGoogle Scholar
- Sham P, Bader JS, Craig I, O'Donovan M, Owen M: DNA Pooling: a tool for large-scale association studies. Nat Rev Genet. 2002, 3: 862-871. 10.1038/nrg930.View ArticlePubMedGoogle Scholar
- Douglas J, Skol A, Boehnke M: Probability of Detection of Genotyping Errors and Mutations as Inheritance Inconsistencies in Nuclear-Family Data. Am J Hum Genet. 2002, 70: 487-495. 10.1086/338919.PubMed CentralView ArticlePubMedGoogle Scholar
- Zou G, Zhao H: Haplotype Frequency Estimation in the Presence of Genotyping Errors. Human Herd. 2003, 56: 131-138. 10.1159/000073741.View ArticleGoogle Scholar
- Gordon D, Finch SJ, Nothnagel M, Ott J: Power and Sample Size Calculations for Case-Control Genetic Association Tests when Errors Are Present: Application to Single Nucleotide Polymorphisms. Human Herd. 2002, 54: 22-33. 10.1159/000066696.View ArticleGoogle Scholar
- Zou G, Zhao H: The Impacts of Errors in Individual Genotyping and DNA Pooling on Association Studies. Genetic Epidemiology. 2004, 26: 1-10.1002/gepi.10277.View ArticlePubMedGoogle Scholar
- Excoffier L, Slatkin M: Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol Biol Evol. 1995, 12: 921-927.PubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.