- Open Access
Precision and type I error rate in the presence of genotype errors and missing parental data: a comparison between the original transmission disequilibrium test (TDT) and TDTae statistics
BMC Genetics volume 6, Article number: S150 (2005)
Two factors impacting robustness of the original transmission disequilibrium test (TDT) are: i) missing parental genotypes and ii) undetected genotype errors. While it is known that independently these factors can inflate false-positive rates for the original TDT, no study has considered either the joint impact of these factors on false-positive rates or the precision score of TDT statistics regarding these factors. By precision score, we mean the absolute difference between disease gene position and the position of markers whose TDT statistic exceeds some threshold.
We apply our transmission disequilibrium test allowing for errors (TDTae) and the original TDT to phenotype and modified single-nucleotide polymorphism genotype simulation data from Genetic Analysis Workshop. We modify genotype data by randomly introducing genotype errors and removing a percentage of parental genotype data. We compute empirical distributions of each statistic's precision score for a chromosome harboring a simulated disease locus. We also consider inflation in type I error by studying markers on a chromosome harboring no disease locus.
The TDTae shows median precision scores of approximately 13 cM, 2 cM, 0 cM, and 0 cM at the 5%, 1%, 0.1%, and 0.01% significance levels, respectively. By contrast, the original TDT shows median precision scores of approximately 23 cM, 21 cM, 15 cM, and 7 cM at the corresponding significance levels, respectively. For null chromosomes, the original TDT falsely rejects the null hypothesis for 28.8%, 14.8%, 5.4%, and 1.7% at the 5%, 1%, 0.1% and 0.01%, significance levels, respectively, while TDTae maintains the correct false-positive rate.
Because missing parental genotypes and undetected genotype errors are unknown to the investigator, but are expected to be increasingly prevalent in multilocus datasets, we strongly recommend TDTae methods as a standard procedure, particularly where stricter significance levels are required.
One of the most-widely used family-based linkage tests in the presence of association is the original transmission disequilibrium test (TDT) statistic . There are two principal limitations regarding the robustness of this original statistic: i) missing parental genotype data; and ii) undetected genotyping errors.
It has been shown [2–4] that both factors may cause an increase in the type I error rate of the statistic, thereby inflating the false-positive rate among the reported linkages. However, no studies to date have quantified the impact that both factors jointly have on inflation of type I error for the original TDT statistic. We designed the TDTae statistic [4, 5] to address these factors. The TDTae is a likelihood ratio test of linkage in the presence of association for general pedigrees. Simulation studies  suggest that the TDTae statistic is robust (in terms of maintaining correct type I error) to the presence of these factors. The TDTae maximizes the likelihood of the data over the genotypic relative risk parameters (R1, R2), population genotype frequencies, and error model parameters under the null hypothesis that the genotypic relative risks are equal to 1 (R1 = R2 = 1) and under the alternative hypothesis that at least one of the genotypic relative risks is not equal to 1 (R1 ≠ 1 or R2 ≠ 1).
The genotype relative risks for a di-allelic locus with wild-type allele + and disease allele d are defined as:
R1 = Pr(affected|+d)/ Pr(affected|++)
R2 = Pr(affected|dd)/ Pr(affected|++).
An important, but overlooked, question concerns the robustness of the precision of TDT statistics in the presence of missing parental genotype data and undetected genotyping error. From this point forward, we shall define precision score as the absolute distance between the location of the disease gene and that marker whose TDT statistic is significant at some pre-specified significance level threshold. While this question has been addressed previously in linkage studies [8, 9] (for factors such as sibling relative risk and locus heterogeneity), it has not, to date, been considered for TDT statistics, particularly in the presence of the two above-mentioned factors.
The datasets we considered for all analyses were the simulated trait data sets from the Genetic Analysis Workshop 14 (GAW14) Workshop. We defined as affected those individuals who were affected (in the phenotypic data file, column referring to affection status of 2 = affected) and for whom phenotype E is present (in the phenotypic data file phenotype E = 1, meaning that the trait is present).
We considered only the three subpopulations consisting of nuclear families. That is, we excluded the New York pedigrees from our analyses, because our TDTae method required significantly more computational time to perform analysis of even one replicate of the New York data. We had knowledge of the true model at the time of the analysis.
Genotype data modification
All replicates were modified by randomly removing 10% of the parental genotypes and also introducing errors to the remaining genotypes using the Sobel-Papp-Lange (SPL) error model  with the penetrances specified in Table 1. In this table, the coded genotype 1 refers to the 11 genotype, the coded genotype 2 refers to the heterozygote 12, and the coded genotype 3 refers to the 22 genotype. Using this table, we observe, for example, that the homozygote 11 had a 1% probability of being misclassified as a 12 and vice versa.
We performed both type I error and precision score analyses by computing the original TDT and the TDTae statistics. The basic unit considered for the analysis within the nuclear families was a trio with two genotyped parents and at least one affected child. For the original TDT , those trios with only one genotyped parent were ignored. Also, the original TDT statistic only analyzed those trios that showed Mendelian consistency for a given marker, while the TDTae statistic analysed all trios with an affected child. An example of a family analyzed by each statistic is provided in Figure 1.
After computing both statistics for all markers in each of the 100 replicates for either the type I error or precision score study, we selected the subset of markers over all the replicates with p-values less than 0.05 (5%), 0.01 (1%), 0.001 (0.1%), and 0.0001 (0.01%).
Empirical type I error rate
We modified the simulated dataset of 95 single-nucleotide polymorphisms (SNPs) markers from chromosome 7, which does not harbor any disease locus, as specified above (Genotype data modification) to compute the empirical type I error rate in the presence of the aforementioned factors. We defined the empirical type I error rate for each statistic (TDT or TDTae) at a given threshold (5%, 1%, 0.1%, and 0.01%) as the proportion of SNPs from the total of 9,500 markers over all 100 replicates that showed p-values less than 0.05, 0.01, 0.001, or 0.0001, respectively.
Precision score study
For the precision score study we considered the SNP trait locus at the end of chromosome 3 and 49 SNPs to one side of it (an average intermaker distance of 3 cM; see Figure 2). As above, we modified the genotype data (Genotype data modification) and analyzed the 100 replicates, each one consisting of 50 SNPs markers on chromosome 3 for all nuclear families across the three subpopulations.
To determine the precision score, we first computed the distance (in marker units) from the trait locus to a marker that showed a significant p-value for a given statistic at a given threshold significance level. The distance is the absolute difference between the trait locus position and marker's position. For example, if marker 35 in a replicate had a p-value less than the threshold for a given statistic, its distance to the trait locus is |50 - 35| = 15 and therefore it precision score for that significance level is 15.
We computed the distribution of the precision score for all significance levels with each statistic by considering various percentiles (minimum, first quartile, median, third quartile and maximum).
Empirical type I error rates
Table 2 shows the results of the empirical type I error rates for each statistic. The TDT and TDTae columns report the proportion of replicates in which the p-values were less than the value x/100. The values reported in parentheses are the lower and upper end points of the 95% confidence intervals computed using the method implemented in the BINOM program http://linkage.rockefeller.edu.
Table 2 shows that the original TDT has appreciable inflation in type I error rates at all significance levels. Furthermore, the inflation increases as the significance level decreases. For example, there is an approximate 6-fold increase in the type I error at the 5% significance level (28.8/5), while there is a 170-fold increase in type I error at the 0.01% significance level (0.017/0.0001).
It is important to note that this inflation is for data with relatively small genotype error rates. We suspect that there is a compounding effect of the type I error inflation for the original TDT when a dataset contains both genotype errors and missing parental genotypes.
Precision score study
Based on the median results for each distribution (Table 2; 1% significance level), half of the significant markers for the original TDT were located at a distance of at least 23, 21, 15, or 7 units from the trait locus at the 5%, 1%, 0.1%, and 0.01% significance levels, respectively. By contrast, at least half of the significant markers for the TDTae were at distance no more than 13, 2, 0, or 0 units from the trait locus.
The results of our analysis on the simulated data suggest that when the alternative hypothesis is true, the TDTae statistic may be a more precise indicator of the trait location than the original TDT statistic in the presence of missing parental data and genotyping errors.
Regarding the empirical type I error rate, we have shown that the TDTae statistic is able to maintain proper type I error rate in the presence of errors for these simulated datasets. The original TDT statistic shows a highly inflated false-positive rate when there are missing parental genotypes and random genotyping errors in the dataset. The results of our simulations suggest that the inflation in type I error increases as the significance level becomes more stringent.
These results have significant consequences for TDT analyses with large numbers of markers, for example studies using microarray technologies . Many of the genotype errors will not be detected , potentially inflating type I error for the original TDT statistic. Because: i) more stringent significance levels are needed to correct for the multiple testing issue; and ii) we observe (Table 2) that type I error inflation is more severe as the significance level becomes more stringent, we strongly recommend that researchers performing TDT analyses on large numbers of markers use methods [13, 14] like the TDTae that incorporate genotype errors into the analysis. Software for our method is available at: ftp://linkage.rockefeller.edu/software/tdtae2/.
We strongly recommend that researchers apply TDTae methods when missing parental genotypes or undetected genotype errors are present and when the number of markers is large. We reason that when more markers are tested, more stringent significance levels are required to correct for multiple testing issues. However, the false-positive rate increases disproportionately for original TDT methods as the significance level becomes more stringent, while our work here suggests that the TDTae maintains proper type I error rates in the presence of missing parental genotype data and genotype errors, even for more stringent significance levels.
Genetic Analysis Workshop 14
Transmission disequilibrium test allowing for errors
Spielman RS, McGinnis RE, Ewens WJ: Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am J Hum Genet. 1993, 52: 506-516.
Curtis D, Sham PC: A note on the application of the transmission disequilibrium test when a parent is missing. Am J Hum Genet. 1995, 56: 811-812.
Mitchell AA, Cutler DJ, Chakravarti A: Undetected genotyping errors cause apparent overtransmission of common alleles in the transmission/disequilibrium test. Am J Hum Genet. 2003, 72: 598-610. 10.1086/368203.
Gordon D, Heath SC, Liu X, Ott J: A transmission/disequilibrium test that allows for genotyping errors in the analysis of single-nucleotide polymorphism data. Am J Hum Genet. 2001, 69: 371-380. 10.1086/321981.
Gordon D, Haynes C, Johnnidis C, Patel SB, Bowcock AM, Ott J: A transmission disequilibrium test for general pedigrees that is robust to the presence of random genotyping errors and any number of untyped parents. Eur J Hum Genet. 2004, 12: 752-761. 10.1038/sj.ejhg.5201219.
Weinberg CR: Allowing for missing parents in genetic studies of case-parent triads. Am J Hum Genet. 1999, 64: 1186-1193. 10.1086/302337.
Schaid DJ, Sommer SS: Genotype relative risks: methods for design and analysis of candidate-gene association studies. Am J Hum Genet. 1993, 53: 1114-1126.
Cordell HJ: Sample size requirements to control for stochastic variation in magnitude and location of allele-sharing linkage statistics in affected sibling pairs. Ann Hum Genet. 2001, 65: 491-502. 10.1046/j.1469-1809.2001.6550491.x.
Finch SJ, Chen CH, Gordon D, Mendell NR: A study comparing the precision of the maximum heterogeneity LOD statistic to two model free linkage methods. Genet Epidemiol. 2001, 21: 315-325. 10.1002/gepi.1037.
Sobel E, Papp JC, Lange K: Detection and integration of genotyping errors in statistical genetics. Am J Hum Genet. 2002, 70: 496-508. 10.1086/338920.
Matsuzaki H, Loi H, Dong S, Tsai YY, Fang J, Law J, Di X, Liu WM, Yang G, Liu G, Huang J, Kennedy GC, Ryder TB, Marcus GA, Walsh PS, Shriver MD, Puck JM, Jones KW, Mei R: Parallel genotyping of over 10,000 SNPs using a one-primer assay on a high-density oligonucleotide array. Genome Res. 2004, 14: 414-425. 10.1101/gr.2014904.
Gordon D, Heath SC, Ott J: True pedigree errors more frequent than apparent errors for single nucleotide polymorphisms. Hum Hered. 1999, 49: 65-70. 10.1159/000022846.
Bernardinelli L, Berzuini C, Seaman S, Holmans P: Bayesian trio models for association in the presence of genotyping errors. Genet Epidemiol. 2004, 26: 70-80. 10.1002/gepi.10291.
Morris RW, Kaplan NL: Testing for association with a case-parents design in the presence of genotyping errors. Genet Epidemiol. 2004, 26: 142-154. 10.1002/gepi.10297.
Grant acknowledgements: NIH-K01-HG00055, NIH-MH44292
SB performed all statistical analyses and wrote the majority of the manuscript. CH developed the computer programs to introduce the errors and remove parental genotype data. MAL developed the ideas for the precision study and wrote a portion of the Results section. DG proposed the research for the GAW14 dataset, supervised all the research, and reviewed all versions of the manuscript for scientific content and grammar.
About this article
Cite this article
Barral, S., Haynes, C., Levenstien, M.A. et al. Precision and type I error rate in the presence of genotype errors and missing parental data: a comparison between the original transmission disequilibrium test (TDT) and TDTae statistics. BMC Genet 6, S150 (2005) doi:10.1186/1471-2156-6-S1-S150
- Parental Genotype
- Genotype Error
- Transmission Disequilibrium Test
- Genetic Analysis Workshop
- Precision Score