Influence of genotyping error in linkage mapping for complex traits – an analytic study

Background Despite the current trend towards large epidemiological studies of unrelated individuals, linkage studies in families are still thoroughly being utilized as tools for disease gene mapping. The use of the single-nucleotide-polymorphisms (SNP) array technology in genotyping of family data has the potential to provide more informative linkage data. Nevertheless, SNP array data are not immune to genotyping error which, as has been suggested in the past, could dramatically affect the evidence for linkage especially in selective designs such as affected sib pair (ASP) designs. The influence of genotyping error on selective designs for continuous traits has not been assessed yet. Results We use the identity-by-descent (IBD) regression-based paradigm for linkage testing to analytically quantify the effect of simple genotyping error models under specific selection schemes for sibling pairs. We show, for example, that in extremely concordant (EC) designs, genotyping error leads to decreased power whereas it leads to increased type I error in extremely discordant (ED) designs. Perhaps surprisingly, the effect of genotyping error on inference is most severe in designs where selection is least extreme. We suggest a genomic control for genotyping errors via a simple modification of the intercept in the regression for linkage. Conclusion This study extends earlier findings: genotyping error can substantially affect type I error and power in selective designs for continuous traits. Designs involving both EC and ED sib pairs are fairly immune to genotyping error. When those designs are not feasible the simple genomic control strategy that we suggest offers the potential to deliver more robust inference, especially if genotyping is carried out by SNP array technology.


Background
Linkage analysis of family data have been extensively used in the past in the search for genetic determinants. Nowadays, investigators favor large epidemiological studies of unrelated individuals, however several family datasets are currently being re-analyzed and/or pooled (e.g. [1]). The persistance of interest for linkage is partly triggered by the advent of single-nucleotide-polymorphisms (SNP) array genotyping technology in the field, indeed SNP arrays hold the promise of more reliable linkage maps [2,3]. Although less prone to genotyping error than microsatellites when viewed as singlepoint markers, SNP arrays heavily rely on multipoint algorithms for accurate determination of the identical by descent (IBD) status of alleles. The gain in singlepoint reliability might therefore be annihilated by the propagation of errors across the many SNPs required to infer IBD status.
In the search for genetic determinants of complex traits by linkage, the use of selective designs appears to be an efficient way to gain adequate power for detection of typically small gene effects. A few authors have shown by simulation that the impact of genotyping error on evidence for linkage could be particularly severe in affected sib-pair (ASP) designs [4][5][6], virtually masking most of the evidence for linkage. The impact of error on quantitative traits appears to be less dramatic in random samples, however it is unclear whether the same dramatic power losses hold in selected samples.
A method of choice is now emerging for the analysis of quantitative traits arising from selected sib pairs. This method is essentially a regression through the origin of excess identical by descent (IBD) sharing on a function of the trait value, whose slope is an estimate of the linkage parameter. It was first proposed by Sham et al. [7] and turns out to be equivalent to a score test [8]. In a numerical comparison of methods for selected samples, Skatkiewicz et al. [9] and Cuenco et al. [10] showed that this method had good properties in finite samples for extreme proband ascertained sib-pair and discordant sib-pair designs. By use of simple genotyping error models (population frequency error model and false homozygosity model), we show analytically what effects such error generating processes (occurring at rate per sib pair) induce for an idealized fully informative marker. It is shown that it results in a reduction of the slope estimate (i.e. of the estimated linkage parameter) by a factor 1 -whether sib pairs are selected or not. Since the genotyping error rate is typically small, the previous effect on the linkage test is minimal. In addition to this slope effect, the regression's intercept is modified and this may have a much more sizable effect on the test for linkage depending on the sampling scheme used to select sib pairs. Surprisingly, this simple result allows us to predict that in extremely concordant (EC) sib pairs designs and in ASP designs, the effect of genotyping error will be milder as the selection becomes more extreme. In extreme discordant (ED) designs, the effect can in theory be either increased type I error or decreased power depending on the definition of discordance, the genotyping error rate and the true linkage effect; in practice however, for small quantitative trait locus (QTL) effects, the result will be an increased type I error. We argue that the basic error generating mechanisms assumed provide reasonable approximations of real-life situations.
In the next section, we first describe some common errorgenerating processes and quantify their effect on IBD sharing in an idealized situation where marker information is complete. We then briefly sketch the inverse regression approach to linkage, we show analytically what the effect of genotyping error is on this regression and quantify the subsequent bias, power and type I error in common selective designs. We argue that under certain assumptions regarding the error model, one can easily implement a linkage test that incorporates a genomic control for genotyping error. Finally, we discuss some assumptions made in our study and the practical relevance of our findings. In particular, we argue that our results generalize to situations where marker information is incomplete and that the smaller error rates observed in SNP chip array compared to microsatellites offer no protection against bias in analysis.

Genotyping error models
We consider two mechanisms for the generation of errors in marker data, namely the population frequency error model and the false homozygosity model. In those two models, we consider a single marker with m alleles and further assume that a maximum of one allelic error per sib pair can be made and that this happens with probability . This restriction to 'one error per sib pair' is just a first order approximation, for small , of a process where all four alleles would be allowed to be independently erroneous and does not restrict the generalizability of our results.
The population frequency error model re-assigns the erroneous allele (chosen at random among the four forming the sib-pair genotype) to one of the possible m alleles with probability equal to population allele frequency. One mathematical advantage of this model is that the marginal distribution of alleles and genotypes is unaltered. The false homozygosity model keeps homozygotes unchanged but reassigns heterozygotes to homozygotes with alleles equal to one of the two original alleles chosen according to probabilities proportional to population allele frequencies.
To our knowledge, false homozygosity is a common type of error: fairly rare alleles go un-reported in samples. The population frequency error model provides an approximation to a process whereby alleles are misread. Errors at the two † 2 alleles of a marker's genotype might be correlated, we do not consider this type of process in details here although the effect on linkage will be qualitatively the same as in the two other models. We refer the reader to Sobel et al. [11] for a detailed exposé on genotyping error mechanisms. Note that the two models that we have chosen have been used in the past in order to identify potential genotyping errors [4,11].

Impact on IBD sharing
Let's denote by π the proportion of alleles shared identical by descent (IBD) at a certain locus by two siblings. Tests for linkage are based on the IBD sharing distribution and although errors as described earlier are made at the genotype level (G is read as G), the effect of errors on linkage will be entirely mediated via the distortion of the IBD distribution (the true IBD status π of two siblings may be incorrectly inferred as π). We are therefore interested in deriving the probability distribution P(π|π), this is done by conditioning on both the true and observed genotypes as follows: Let us consider the case of complete information. This can be conceptualized by means of an idealized marker whose number of alleles is infinite, in particular identity by state (IBS) status is equivalent to IBD status. The unordered genotypes of a sib pair can be partitioned into seven exclusive classes denoted ii/ii, ii/ij, ii/jj, ii/jk, ij/ij, ij/ik and ij/kl depending on the number of homozygous sibs in the pair and the number of distinct alleles in the sib-pair genotype. Sharing 0 alleles IBD corresponds to a sib-pair genotype of the ij/kl class, should an error occur according to the population frequency error model then one of the four alleles would be transformed into yet another type (since the number of alleles is infinite, the probability that the new allele is read as one of i, j, k or l tends to 0), therefore the sib pair genotype will remain in the ij/kl class and the observed IBD status π will still be 0. For the same starting genotype, an error according to the false homozygosity model produces an ii/jk class and π also equals 0 therefore P(π = 0|π = 0) = 1 whatever the genotyping error mechanism considered previously. The same line of reasoning leads to P(π = 0.5|π = 0.5) = 1 -, P(π = 0|π = 0.5) = , P(π = 1.0|π = 1.0) = 1 -, P(π = 0.5|π = 1.0) = . Those results can be summarized by the transition matrix below, where the (i, j) element is equal to P(π = (j -1)/2|π = (i -1)/2) The overall effect of genotyping error is thus to reduce the observed IBD sharing, indeed E(π|π) = (1 -/2)π and E(π) = -/4 while the variance is practically unchanged since . In selected samples of extremely concordant sib pairs (EC) where linkage is evidenced by an excess in IBD sharing, it therefore seems logical to expect a decrease in power. Conversely, in selected samples of extremely discordant sib pairs (ED) where linkage is evidenced by a reduction in IBD sharing, the test might lead to increased type I error. In the next subsection, we formally quantify this bias in selective samples schemes for quantitative traits under the usual assumption of a normal variance components model.

Impact on linkage testing Regression-based linkage testing
We assume that the sib pair phenotypic data x = (x 1 , x 2 )' have been adjusted for any relevant covariates (e.g. sex, age, country, ...) and have been standardized so that the (known) population mean, variance and sib-sib correlation are 0, 1 and ρ respectively. Under the additive variance components model, x given IBD information p follows a bivariate normal distribution with zero mean and variance-covariance matrix given by where γ ≥ 0 denotes the proportion of total variance explained by the putative locus. Under this model, an optimal testing strategy first advocated in [7] (and sometimes referred to as the optimal Haseman-Elston regression) is to regress (through the origin) excess IBD sharing π -on the following C function of the trait values: This test turns out to be a score test for the linkage parameter γ [8] and is based upon the following approximate relation which is valid for small locus effects [12]: Fisher's information , which depends on sample size and study design, therefore controls power. In the design phase of a study, should be used as a criterion to differentiate between alternative designs rather than sample size only [12,13].

Impact of genotyping error on regression
By conditioning on the true IBD sharing values, we can compute P(π|x, γ, ) = ∑ π P(π|π) P(π|x, γ), using the transition probabilities P(π|π) derived earlier, while the P(π|x, γ)'s are given in [12]. This permits computation of the new regression line in presence of genotyping error as As mentioned earlier, the corresponding variance under the null hypothesis is only slightly altered. The effect of genotyping error is thus to shrink the regression line by a factor 1 -and to shift the intercept by -. If we ignore genotyping error i.e. we estimate γ using , this results in a biased estimator with .
The resulting testing statistic would then have power equal to Note that taking γ = 0 in this formula gives the type I error rate. Since increases with sample size, the impact of genotyping error on both power and type I error will be larger as the sample size increases. In terms of Y versus X regression, the intuition is that the regression through the origin is not affected by a general shift in the Y-variable (IBD sharing) if the X-variable (C variable) has average 0, or takes values far away from 0. The further away the Xvariable C is from 0, the smaller A, hence the smaller the bias.

Bias and impact on power and type I error
Since and γ is typically small, the distortion of the usual linkage test in presence of genotyping error heavily depends on the design-specific quantity . Unfortunately, there is little intuition about the distribution of C (hence about the distribution of A) in the whole population or in a selected sample. Nevertheless, Monte Carlo simulations can be used to determine the characteristics of the C and A distributions in the whole population or for a specific ascertainment scheme. In random samples and under the variance components model, C is a score function hence E(C) = 0 therefore its sample estimate will be close to 0; one can also check that its distribution is negatively skewed (unless ρ = 0).
The result is that the bias will be small for random samples. The same finding would hold for any ascertainment scheme where = 0. An optimal selection scheme [12] that would select sib pairs based on Fisher's information (i.e. such that |C| ≥ C 0 ) does not warrant that = 0 because of the skewness of C. In EC designs (both siblings have trait values either larger than a positive threshold or smaller than a negative threshold), tends to be positive while it tends to be negative in ED designs (one sibling's trait value is larger than a positive threshold while the other sibling's trait value is smaller than a negative threshold), the linkage test will therefore have reduced power in EC designs and increased type I error in ED designs.
In the left-hand side of Table 1, we have computed the values of A and for the three selective schemes considered.
The designs are indexed by the sib-sib correlation ρ and the degree of selection. One obvious way to correct for the shift in the intercept induced by genotyping error would be to leave the regression unconstrained, this would correct for most of the bias. Unfortunately, in selected designs where the variance of C is reduced, this results in In Table 2, we report the power and type I error for realistic genotyping error rates [14] equal to 0.005 and 0.01 for the same designs as in Table 2. The equivalent sample size used corresponds to samples with Fisher's information equal to 2500 which provides 90% power to detect a QTL explaining 10% of the total variance in absence of genotyping error (pointwise nominal error rate = 10 -4 ). The most visible impact is on type I error rates in ED design which is up to 7 times its nominal value. The design that combines EC and ED sib pairs appears to be fairly immune to genotyping error while EC designs do not incur power losses greater than 20%. Finally, those computations confirm the intuition expressed earlier that the effect of genotyping error is less severe in more extreme selection schemes.

Genomic control for genotyping error
As we have seen in previous sections, the main effect of genotyping error is to modify the intercept in the regression used to test for linkage. Although an unconstrained regression would correct most of the bias due to genotyping error, the inefficiency of this strategy makes it impractical. In order to obtain an efficient and robust inference, it therefore seems natural to try and constrain the regression through its correct origin a. In this section, we propose a completely data-driven strategy for doing this.
At any position, the sample mean IBD sharing has variance 1/8n where n is the number of sib pairs available. If we knew that the position is unlinked or if the sample of sib pairs was random then the deviation of this mean from would provide an estimate of the intercept a in the linkage regression.
Unfortunately, detection of a position-specific intercept corresponding to typical error rates would require a sample size of order 10 4 , a number that is almost never var ( ) / con γ = ∑


In selected samples, we can use a trimmed version of the mean of y, for example a 20%-trimmed mean of the (y t ) t series (i.e. the mean of the y t values after removing the 20% lowest and and 20% highest values) will provide a robust genomic estimate of a. Because a ≤ 0 and is positive and negative in EC designs and ED designs respectively, could be refined by trimming off only the 20% highest and lowest y t values respectively before taking the mean. Of course, how much we trim is arbitrary but 20% can safely be taken as a conservative value for oligogenic traits (Indeed, a 3500 cM genome contains approximately 70 quasi-independent loci, so a 20% trimming of y t values discards 14 positions (including all active gene positions if less than 14 genes) from the sample used to estimate intercept a.). An ad-hoc implementation of the concept of genomic control is then to plug in the estimate of the intercept into the linkage regression (3). Since most of the bias in the inference is due to the intercept mis-specification, the precise estimate obtained by pooling across the genome will eliminate it. The implicit assumption that we make in this genomic control approach is that the regression intercept is the same at all positions, this will be challenged in the next section.

Discussion
Under two basic error models, we were able to predict quantitatively the consequences of genotyping error on inference in linkage analysis. In the idealized situation of complete IBD information, both error models have the same impact on linkage analysis. As we have seen, the effect is due to a decrease in IBD sharing. A contrario, an error process which would increase IBD sharing would produce opposite results. The true error processes involved in practice are complicated mixtures of the models alluded to here. In our experience however, it seems that processes which lower IBD sharing are predominant. Because genotyping error tends to decrease the estimated number of alleles shared IBD, the effect on evidence for linkage is opposite in EC (reduced power) and ED (increased type I error) designs, it can be dramatic in typical designs and paradoxically less severe for more extreme ascertainment schemes. By analogy, for a dichotomous trait, this means that the effect of genotyping error is less severe in ASP designs for rare diseases than for common diseases. Remarkably, in designs combining both ED and EC pairs like the (or EDAC designs), the competing effects of genotyping error tend to cancel each other out. We have considered here only three types of basic selec-tion schemes however the approach can be straightforwardly applied to any arbitrary selection scheme. Under the widely accepted variance components model, the important quantity which determines bias, type I error and power is and it can be easily estimated by Monte Carlo simulations. Note that the bias is proportional to the error rate so that Equation (4) can easily be adapted to different error rates than those considered in Table 2.
Our study used an idealized model where IBD information is assumed to be complete. In practice, IBD is uncertain and it is inferred using marker data and multipoint algorithms as implemented in publicly available software [16,17], the general effect is to shrink the IBD estimate towards 0.5. The linkage regression (2) is changed into where can be either estimated from the data or by simulations. The effect of genotyping error is again mediated via the shift of the intercept in this regression but no general formula can be obtained because it depends in a very complex manner on the whole marker map configuration. Nevertheless, we can quantify this shift under realistic scenarios and compare it to its theoretical value when IBD information is complete. We simulated two different marker maps in 1 million sib pairs without parents and quantified by how much IBD sharing was reduced on average under the population frequency error model (error rate = 0.01). The microsatellites map (MS) had 13 equifrequent ten-allele markers (heterozygozity = 90%) located 10 cM apart (spanning the 0-120 cM chromosomal region) and the SNP map had 41 equi-frequent SNPs (heterozygozity = 50%) spanning the 50-70 cM chromosomal region (this smaller region was chosen to keep simulation time acceptable). The resulting average reduction in IBD sharing for an error rate of 0.01 was measured every 2 cM in the 50-70 cM region, it ranged from 0.4974 to 0.4976 in the MS map and from 0.4945 to 0.4955 in the SNP map. For these two maps which mimic the two most widespread genotyping paradigms nowadays, those simulations confirm results derived under the complete marker information assumption with a reduction in IBD sharing from 0.5 to 0.5 -0.01/4 = 0.4975. Our results therefore appear to be applicable to real-life situations where IBD information is incomplete.
The genomic-control strategy that we have proposed, although triggered by the specific issue of genotyping error, potentially offers a general robust method for carry- ing out linkage analysis. It is nonetheless important to recognize its limitations. Firstly, if the trait is highly polygenic with contributing genes scattered across the genome, the high correlation between linkage positions will make it impossible to estimate the IBD sharing at null positions. The genomic control strategy should therefore only be considered with oligogenic traits. Secondly, the concept of genomic control relies on the assumption that the genotyping error rates are similar across markers. For markers with a similar degree of polymorphism (number of alleles and frequencies), this assumption might be acceptable. In a multipoint setting, an additional assumption required to ensure the validity of a genomic control strategy is that inter-marker distances be approximately equal. With microsatellite markers, both these assumptions might fail resulting in differences in the IBD sharing reduction across markers. The 'regression-based linkage testing' view allows one to qualitatively assess how deviation from these assumptions will impact linkage testing. For example, in ASP or EC designs, wrongly assuming that IBD is uniformly reduced across markers will result in inflated type I error at marker positions with low genotyping error rate compared to other markers. The advent of SNP chips in linkage mapping holds the promise of regular marker maps with less variable information content than in classical microsatellites maps [2,3]. The many SNPs used are likely to be subject to similar genotyping error processes, this makes the critical assumption of the genomic control strategy all the more plausible. Alternatives to this genomic-control strategy are possible and they also consist in constraining the linkage regression through a new origin as in the ad-hoc method, the estimation procedure can be adapted to suit particular circumstances. Firstly, in random samples, the assumption regarding exchangeability of positions might be relaxed. Indeed, the reduction in IBD sharing at each position may be used as estimates of the position-specific intercepts (a study sufficiently powered to detect linkage in random samples should have a huge sample size which would ensure sufficient precision of the position-specific intercepts). However, it must be stressed that the advantage of using a genomic control in random samples is limited because the impact of genotyping error is small in such designs. Secondly, one could use previous lab data to estimate by how much IBD sharing deviates from its expected value, this could also be done at each position separately provided sufficient data are available. In practice, such data might not be available or they might not trustfully reflect current error mechanisms.
Elston et al. [18] have pointed out that the implicit assumption made in ASP designs, that randomly sampled sib pairs share half of their alleles IBD, might not hold in practice and have argued for including discordant pairs in such studies. The genomic control approach suggested here may be an alternative solution to this issue. Finally we note that, although we have only considered designs involving sib pairs, the approach naturally extends to other types of relative pairs.

Conclusion
Under realistic genotyping error scenarios, power losses observed in extremely concordant designs are modest but the effect on type I error in extremely discordant designs can be dramatic. Our analytic approach provides some understanding of the differences in influence of genotyping errors across study designs. The advent of SNP arrays does not eliminate the impact of genotyping errors but it makes genomic control a feasible option with the potential to deliver more robust inference in linkage analysis data subject to genotyping errors or other mechanisms distorting the IBD signal.