We investigated the impact of genotyping errors on the performance of the Mantel Statistic Using Haplotype Sharing and the haplotype-based score test, haplo.score. Both are haplotype-based tests that have been applied to case-control data in population-based candidate gene association studies. The haplo.score is based on genotypes incorporated in a generalized linear model framework, but accounts for the uncertainty of haplotype phase within the calculations. On the other hand, the Mantel Statistic Using Haplotype Sharing needs the complete phase information of the individuals under study, i.e. the corresponding haplotype pair for each individual. For better comparison of the two haplotype-based methods we used the individuals' haplotype pairs determined via the EM-algorithm, implemented in R (haplo.em) [31]. This algorithm is also used by haplo.score and provides the same haplotype frequency distribution as incorporated in the haplo.score procedure. Haplo.score tests the global hypothesis of whether an association between any examined haplotype exists, whereas the Mantel Statistic Using Haplotype Sharing is a pointwise test that only incorporates the information of neighbouring markers. Hence, the comparison of these two methods might be hampered by the fact that they test different null hypotheses. We therefore additionally investigate the pointwise Armitage trend test as a third association test statistic. It has been shown that the trend test achieves greatest power, compared to other Chi-Squared tests, when there is no prior knowledge of the underlying disease model or in the presence of deviation from HWE [29]. Since the disease model in this simulation study is known to be recessive, the 2 × 2 (1 df) Chi-Squared test for independence with the count of the homozygote minor alleles (0, 0, 1) should be the most powerful test [34]. However, in the presence of genotyping errors, the advantage in power of this 1 df Chi-Squared-test over the Armitage trend test, for a known recessive disease model, could not be confirmed [29].

We find that in the presence of genotyping errors, with a mean error rate per locus of 0.2%, the type I error rate and the empirical power of all three statistics are not affected at all. The type I error rate is highly inflated only for high and differential genotyping error rates (8% and 15.6%). The magnitude of increase in type I error rate depends on the sample size, i.e. type I error rate is more inflated with a larger than with a smaller sample size, which was also previously reported by Moskvina et al. [25] and can be explained by the fact that differential errors are systematic errors ([35], p. 116). In the presence of high differential error rates, the two haplotype-based approaches were more sensitive, i.e. showed clearly higher type I error rates compared to the Armitage trend test.

Genotyping errors affect the number of haplotypes and shift the haplotype distribution towards an increased number of rare haplotypes. The amount of rare haplotypes increases with higher genotype error rates. The size of the study sample is also positively correlated with the number of additional haplotypes due to genotype errors. Our results indicate that this large amount of rare haplotypes is the reason for the inflation of the type I error rate of the Mantel Statistic Using Haplotype Sharing, since the statistic is based on all haplotypes, including the many rare ones. Our results agree with the previously reported observations of an inflated type I error rate in the presence of undetectable or sample-specific errors (differential errors) of former investigations [6–8, 29].

The power gain for all three association test statistics for high and differential genotype errors is coherent in view of the above mentioned inflated type I error rate. We also observe a loss in power for nondifferential genotyping errors, as reported by Heid et al. [28]. On the other hand, the observation of Moskvina et al. [25] that the type I error rate of a haplotype-based association statistic is highly inflated even in the presence of a small genotyping error rate of less than 1% cannot be confirmed with this simulation. Moskvina et al. [25] draw this conclusion for markers in high LD and a relatively low minor allele frequency, whereas the markers we examined comprise haplotypes in two blocks of high LD and have MAFs of between 0.028 and 0.45. Nevertheless, we are able to confirm the effect of sample size on the type I error rate in the presence of genotyping errors, which Moskvina et al. [25] reported.

There has been a lively discussion on whether the exclusion from data analysis of markers that are not in HWE is an appropriate way to deal with genotyping errors [19–21]. Our results support the criticism of this approach, showing that the proportion of genotyping errors detected by testing for deviation from HWE can be quite low. Especially, in the case of common alleles, deviation of HWE is not a sufficient indicator for genotyping errors. We should point out that the chosen cut-off of p < 0.05 to indicate significant deviation from HWE is already very strict. Choosing a less stringent cut-off, as often suggested and conducted in practice, would further decrease the number of genotyping errors detected. Differential errors have been simulated to occur either only in cases or in different markers for cases and controls, as in most real situations. Thus, the test of deviation from HWE applied to controls only is not at all appropriate to detect such differential errors. Hence, the exclusion of markers not in HWE does not reduce the inflated type I error rate substantially. Furthermore, the exclusion of markers leads to a general loss in power, since markers truly associated with disease may also be eliminated.

We show that in the presence of a realistic amount of genotype errors (with a mean error rate per locus of 0.2%), all three examined methods to test association in candidate regions perform well. The Mantel Statistic Using Haplotype Sharing and the Armitage trend test hold their pointwise and the haplo.score its global nominal significance level of 5%. The power to detect the putative disease locus or a haplotype specific association remained high with 89%–94%.