Volume 6 Supplement 1
Genetic Analysis Workshop 14: Microsatellite and singlenucleotide polymorphism
Impact of nonignorable missingness on genetic tests of linkage and/or association using caseparent trios
 ChaoYu Guo^{1, 4}Email author,
 Jing Cui^{2} and
 L Adrienne Cupples^{3}
DOI: 10.1186/147121566S1S90
© Guo et al; licensee BioMed Central Ltd 2005
Published: 30 December 2005
Abstract
The transmission/disequilibrium test was introduced to test for linkage disequilibrium between a marker and a putative disease locus using caseparent trios. However, parental genotypes may be incomplete in such a study. When parental information is nonrandomly missing, due, for example, to death from the disease under study, the impact on type I error and power under dominant and recessive disease models has been reported. In this paper, we examine nonignorable missingness by assigning missing values to the genotypes of affected parents. We used unrelated caseparent trios in the Genetic Analysis Workshop 14 simulated data for the Danacaa population. Our computer simulations revealed that the type I error of these tests using incomplete trios was not inflated over the nominal level under either recessive or dominant disease models. However, the power of these tests appears to be inflated over the complete information case due to an excess of heterozygous parents in dyads.
Background
When parental genotypes are missing at random (MAR) in caseparent trio studies, Clayton [1] and Weinberg [2] suggested a partialscore test and a likelihood ratio test, respectively, to deal with such data. Under the same situation, Sun et al. [3] introduced the 1TDT, a transmission/disequilibrium test (TDT)type test based on a set of noniterative estimates of the genotype relative risk (GRR) [4]. Recently, the expectation maximizationhaplotype relative risk (EMHRR) proposed by Guo et al. [5] extended the HRR test [6] to accommodate trios with one or no parental genotypes, and it outperforms the 1TDT in a homogeneous population. However, when the MAR assumption is violated, occurring when missingness is nonignorable due, for example, a missing pattern of parental genotypes is related to the disease under study, these tests may be invalid.
To assure a valid test for association between a marker and a putative disease locus under nonignorable missingness (NIM), Allen et al. [7] introduced a testing procedure based on the joint likelihood of the genotypes of the proband and the observed parents, conditioning on the proband's phenotype and parental missingness pattern. Still, the validity of their method under population stratification is not guaranteed, because it depends on whether the missingness model is suitably specified or not. Therefore, Chen [8] proposed another TDTtype approach based on the conditional likelihood of the proband's genotype given the number and, if any, genotypes of available parents, as well as the proband's phenotype to assure the validity of testing for association between a candidate gene and a disease.
The cost of accounting for NIM is a loss of power under MAR (it is less powerful than the 1TDT) as indicated by Allen et al. [7]. Their results also suggested that, under NIM, the 1TDT performs better than the proposed tests by Clayton [1] and Weinberg [2], because the type I error of the 1TDT is less inflated over the nominal level. In addition, the 1TDT is a valid test if the NIM is a result of population stratification, while Clayton [1] and Weinberg's [2] methods are not. Hence, the 1TDT is preferred among those tests for incomplete trios that require the MAR assumption. Because the comparison between the 1TDT and EMHRR under NIM is unknown, we examined the performance of the two tests using Genetic Analysis Workshop 14 (GAW14) simulated data.
Methods
EMHRR
First consider a diallelic marker with alleles B_{1} and B_{2}. represent the observed count for each type of trio data, where k = 0, 1, or 2 represents total number of B_{1} alleles transmitted to the offspring, and i, j = 0, 1, or 2 represents total number of B_{1} alleles for fathers and mothers, respectively. Note that the superscript * is used when the parental genotype is missing. Curtis and Sham [9] showed that bias in estimating the probability of transmission of certain alleles is introduced if heterozygous affected children with one heterozygous parent families are excluded. For simplicity we denote these dyad families by whenever the father or mother is missing, because we assume no difference according the sex of the parent. Guo et al. [5] applied the EM algorithm to estimate the proportion of heterozygous parents transmitting B_{1} and not B_{2} ( ) and transmitting B_{2} and not B_{1} ( ) alleles among families to avoid such bias. The details of the EM procedure are available in Guo et al. [5].
The HRR compares parental marker alleles transmitted to an affected child to those not transmitted. One feature of the HRR for dealing with such triotype family data is that the affected children's genotypes are always known (assuming no genotyping failure) due to ascertainment procedures in which data from an affected individual is collected first and then that of his/her parents. Hence, in the case group, the two transmitted alleles of all affected children are known and can be used in the analysis, even when both parents' genotypes are not available.
Let U_{i}, V_{i}, W_{i}, and X_{i} represent the total number of transmitted B_{1} alleles, nontransmitted B_{1} alleles, transmitted B_{2} alleles, and nontransmitted B_{2} alleles from type i families, where i = 1 for complete trios, 2 for dyads (trios with one parental genotype available), and 3 for monads (trios without parental genotypes). Note that only and require the EM estimates; the rest can be inferred without the EM algorithm and can be uniquely determined. Both V_{3} and X_{3} are 0, because no parental genotypes are available to infer what alleles are not transmitted.
The EMHRR is defined as (U_{1} + U_{2}) × (X_{1} + X_{2}) / (V_{1} + V_{2}) × (W_{1} + W_{2}), if type 3 families are excluded. If all families ascertained are used for analysis regardless of missing one or two parents, then the EMHRR becomes (U_{1} + U_{2} + U_{3}) × (X_{1} + X_{2}) / (V_{1} + V_{2}) × (W_{1} + W_{2} + W_{3}). Under the null hypothesis of no linkage or no association, the EMHRR is expected to be 1 and the test statistic follows a central chisquare distribution with 1 degree of freedom. Note that Var(EM  HRR) can be approximated by ( ), if the type 3 families are excluded and by ( ), if all three type of families are used.
Simulations
One affected child was randomly selected in each nuclear family in order to maintain the independence among ascertained trios. Both dominant and recessive disease models were examined, since we used traits "b" (dominant) and "l" (recessive) for ascertainment. For trait b, SNPs C01R0052 and C01R0001 were used in power and type I error simulations, respectively. Similarly, SNPs C09R0765 and C09R0850 were used for trait l (several loci were examined with similar results but not shown here). Based on resampling of the 100 replicates provided, 10 replicates in the Danacaa (DA) population were randomly selected for each simulation. A total of 1,000 simulations were conducted for power and type I error comparisons.
The TDT and HRR were first applied to the complete trios. To illustrate the impact of NIM, we examined the extreme case by assigning parental genotypes to be missing if they were affected. We first determined the missing rate for parents in the NIM simulations (there was no difference in sex specific missing rates), then generated a MAR dataset of equal amounts of missing data by randomly assign parental genotypes to be missing according to that rate. The 1TDT and EMHRR were both applied to the subset of complete trios and dyads, but only EMHRR can accommodate monads under NIM and MAR.
Results
Recessive trait (L): 120 trios on average
Locus: C09R0765  Locus: C09R0850  

Power %  Type I error %  
NIM*  MAR**  NIM  MAR  
TDT_{trad}  34.2  3.1  
HRR_{trad}  34.2  3.0  
TDT_{comp}  33.7  27.1  4.7  3.1 
HRR_{comp}  33.6  27.1  5.4  3.0 
1TDT  39.5  31.7  4.5  2.9 
EMHRR_{dyads}  44.2  31.9  3.9  3.1 
EMHRR_{all}  42.3  32.1  3.3  3.3 
Dominant Trait (B): 750 trios on averge
Locus: C01R0052  Locus: C01R0001  

Power %  Type I error %  
NIM*  MAR**  NIM  MAR  
TDT_{trad}  16.2  5.5  
HRR_{trad}  16.4  5.3  
TDT_{comp}  11.5  9.8  4.6  4.7 
HRR_{comp}  11.4  10.0  4.4  4.6 
1TDT  20.4  12.5  5.1  4.8 
EMHRR_{dyads}  23.5  14.1  4.3  4.4 
EMHRR_{all}  22.8  14.2  4.4  4.6 
As in results reported by Ewens et al [11], the TDT and HRR perform similarly (comparable power in TDT_{trad} and HRR_{trad}, or TDT_{comp} and HRR_{comp}) in detecting linkage disequilibrium (LD) between a marker and a putative disease locus in a homogeneous population. In all the situations simulated, TDT_{comp} and HRR_{comp} have the lowest power, and the difference between TDT_{trad} and TDT_{comp} or HRR_{trad} and HRR_{comp} is the loss of power due to exclusion of incomplete trios. The increase from TDT_{comp} to 1TDT or HRR_{comp} to EMHRR_{dyads} represents a gain of power by including dyads. The difference between EMHRR_{dyads} and EMHRR_{all} indicates the gain or loss of power by additionally utilizing monads, which is not applicable for the 1TDT test. Because the transmitted alleles are always present regardless of missing one or two parental genotypes in the HRR statistic, EMHRR_{dyads} and EMHRR_{all} are more powerful than 1TDT under both dominant and recessive disease models regardless of MAR or NIM.
Under NIM, the probability distribution functions of monads changed the most compared to dyads, which resulted in adding more noise to the EMHRR statistic. As a consequence, we observed a loss of power in the EMHRR due to the utilization of monads. Therefore, EMHRR_{all} is more powerful than EMHRR_{dyads} under MAR, but their performances are reversed when the missing pattern is informative.
When parental genotypes are missing nonrandomly due to a recessive disease, only homozygous parents with two copies of the disease alleles will be missing, assuming there are no phenocopies. Therefore, the subset of complete trios or dyads has more heterozygous (informative) parents compared to those under MAR. One can see that, under NIM, the loss of power from TDT_{trad} to TDT_{comp} or HRR_{trad} to HRR_{comp} is less compared to the MAR situation. In addition, the EM procedure yields higher informative transmissions based on excess heterozygous (informative) parents. Therefore, the power of EMHRR using both the subset of complete trios and dyads is higher than HRR using the complete dataset (Table 1). The results are similar under a dominant disease model as seen in Table 2.
Allen et al. [7] and Chen [8] showed that type I errors of MAR tests were inflated over the nominal level. Although our simulations results did not match theirs, we see informative changes in the type I errors. When parental genotypes are missing nonrandomly due to a dominant disease, parents with two copies of normal alleles are not affected, assuming no phenocopies. Hence, these types of parents will be more likely to be in the subset of complete trios and dyads. It is evident then that the loss of power from TDT_{trad} to TDT_{comp} or HRR_{trad} to HRR_{comp} is greater compared to the loss under a recessive disease model. This phenomenon also affects the type I error. Hence the type I error of HRR_{comp} and TDT_{comp} are smaller than TDT_{trad} and HRR_{trad} (Table 2), but for a recessive disease, the results are reversed (Table 1), because the heterozygous parents are more likely to be in the subset of complete trios and dyads.
Conclusion
The HRR was the first familybased test for LD between a marker and a putative disease locus. Because the TDT performs better than the HRR under extreme admixture, the HRR is not as popular as the TDT. Due to the data structure of the HRR, the transmitted alleles are always present regardless of the absence of one or both parents. Therefore, the EMHRR is more powerful than the 1TDT when the population is under HardyWeinberg equilibrium or slightly admixed. Because there is no admixture in the DA population, we found that the EMHRR is the more powerful test when parental genotypes are missing randomly, and the superiority of the HRR remains despite the impact of NIM. Because the use of affected children without parental genotypes does not improve the power of the EMHRR with NIM, we recommend using the EMHRR with the subsets of complete trios and oneparent data for testing LD between a marker and a putative disease locus when the missing data pattern is unknown.
Under a different mechanism of NIM and no phenocopies in the simulated Dananca population, our results do not match those of the 1TDT with inflated type I error reported by Allen et al. [7] and Chen [8]. Instead, we observed a different performance of MAR tests under NIM. Although it is easier to observe that the 1TDT and EMHRR_{dyads} are more powerful than TDT_{trad} and HRR_{trad} when their type I errors are inflated over the nominal level, our results suggest that in the GAW14 simulated data, parents with different genotypes are equally likely to be diseased under the null hypothesis, and that differential missing rates occur only under the alternative hypothesis.
Abbreviations
 EM:

Expectation maximization
 GAW14:

Genetic Analysis Workshop 14
 GRR:

Genotype relative risk
 HRR:

Haplotype relative risk
 LD:

Linkage disequilibrium
 MAR:

Missing at random
 NIM:

Nonignorable missingness
 TDT:

Transmission/disequilibrium test
Authors’ Affiliations
References
 Clayton D: A generalization of the transmission/disequilibrium test for uncertain haplotype transmission. Am J Hum Genet. 65: 11701177. 10.1086/302577.
 Weinberg CR: Allowing for missing parents in genetic studies of caseparent triads. Am J Hum Genet. 1999, 64: 11861193. 10.1086/302337.PubMed CentralView ArticlePubMedGoogle Scholar
 Sun F, Flanders W, Yang Q, Khoury J: Transmission disequilibrium test (TDT) with only one parent is available: The 1TDT. Am J Epidemiol. 1999, 150: 97104.View ArticlePubMedGoogle Scholar
 Schaid DJ, Sommer SS: Genotype risk ratio: Methods for design and analysis of candidategene association studies. Am J Hum Genet. 1994, 55 (2): 4029.PubMed CentralPubMedGoogle Scholar
 Guo CY, DeStefano AL, Lunetta KL, Dupuis J, Cupples LA: Expectation maximization algorithm based haplotype relative risk (EMHRR): Test of linkage disequilibrium using incomplete caseparent trios. Hum Hered. 2005, 59 (3): 12535. 10.1159/000085571.View ArticlePubMedGoogle Scholar
 Falk CT, Rubinstein P: Haplotype relative risks: An easy reliable way to construct a proper control sample for risk calculations. Ann Hum Genet. 1987, 51: 227233.View ArticlePubMedGoogle Scholar
 Allen AS, Rathouz PJ, Satten GA: Informative missingness in genetic association studies: caseparent designs. Am J Hum Genet. 2003, 72: 671680. 10.1086/368276.PubMed CentralView ArticlePubMedGoogle Scholar
 Chen YH: New approach to association testing in caseparent designs under informative parental missingness. Genet Epidemiol. 2004, 27: 131140. 10.1002/gepi.20004.View ArticlePubMedGoogle Scholar
 Curtis DR, Sham PC: A note on the application of the transmission disequilibrium test when a parent is missing. Am J Hum Genet. 1995, 56: 811812.PubMed CentralPubMedGoogle Scholar
 Ewens WJ, Spielman RS: The transmission/disequilibrium test: history, subdivision and admixture. Am J Hum Genet. 1995, 57: 455464.PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.