 Proceedings
 Open Access
 Published:
Bootstrap calibration of TRANSMIT for informative missingness of parental genotype data
BMC Genetics volume 4, Article number: S39 (2003)
Abstract
Informative missingness of parental genotype data occurs when the genotype of a parent influences the probability of the parent's genotype data being observed. Informative missingness can occur in a number of plausible ways and can affect both the validity and power of procedures that assume the data are missing at random (MAR). We propose a bootstrap calibration of MAR procedures to account for informative missingness and apply our methodology to refine the approach implemented in the TRANSMIT program. We illustrate this approach by applying it to data on hypertensive probands and their parents who participated in the Framingham Heart Study.
Background
Missing parental genotype data is a common problem in association studies utilizing parental controls and has led to the development of a variety of approaches aimed at extending standard methods such as the transmissiondisequilibrium test (TDT) to allow for a parent's genotype to be missing [1–4]. A common assumption made in such approaches [1, 2] is that parental missingness is not related to the underlying, unobserved, genotype of the missing parent. This assumption is referred to as noninformative missingness or missingness at random (MAR) and allows reconstruction of the missing parent's genotype using the genotype frequencies among the observed parents and constraints imposed by spouse and offspring genotypes. Informative missingness, on the other hand, occurs when a parent's missingness is related to his or her genotype at the locus of interest. In this case, the distribution of genotypes in missing parents cannot be immediately constructed from the distribution of genotypes among parents in intact trios. Therefore, when informative missingness is present, procedures that assume MAR will tend to reconstruct the genotypes of missing parents incorrectly, leading to biased results.
Informative missingness can occur in several ways. First, the parent's genotype may in fact be associated with the disease of interest which in turn, if manifest in the parent, may cause or influence missingness. Second, the genotype may be associated with a different disease that results in parental missingness. Finally, informative missingness can arise strictly as an artifact of population stratification. Consider a population comprising a number of subpopulations with varying allele frequencies and assume that the probability of a parent being missing also varies across these subpopulations. Even though there may be no relationship between missingness and parental genotype in any given subpopulation, in the overall population, the two may be correlated, resulting in informative missingness. Note that the second and third situations described above affect the null distribution of parental controlled association tests and can, as a result, lead to invalid inference. The first situation only affects the alternative hypothesis and hence only affects power.
Allen et al. [5] developed parental controlled association tests that are valid when parental genotype data are informatively missing. This approach retains comparable power when the data are MAR but assumes the marker being considered is biallelic. Here we propose a new approach to multiallele association testing that is robust to informative missingness. This approach uses a multiallele extension of the missingness model presented in Allen et al. [5] and a bootstrapping procedure to recalibrate MARbased methods to account for informative missingness. In the next section we present the missingness model and how the bootstrap calibration procedure can be used to correct MARbased TRANSMIT results for informative missingness. We illustrate our methodology by applying it, as well as an unadjusted MARbased approach, to data on hypertensive probands and their parents who participated in the Framingham Heart Study. We conclude with a discussion of the results of this analysis including the possibility of informative missingness in this data set.
Methods
A model for informative missingness
Assume a triobased sampling design in which N individuals with disease or trait of interest (denoted by D = 1) are sampled and, when possible, their parents are also recruited. At a locus of interest, let the proband genotype be denoted G_{ o }and let G_{ f }(G_{ m }) denote the paternal (maternal) genotypes. Let G_{ p }= (G_{ f }, G_{ m }) and assume there are K alleles at the locus of interest. Let R be an indicator of missingness so that if neither parent is missing, R = (0, 1) if the father but not the mother is missing, R = (1, 0) if the mother but not the father is missing, and R = (0, 0) if both parents are missing. Let and denote the observed and missing parental genotype information respectively, so that for R = (1, 0) = G_{ f }and = G_{ m }. If R = (1, 1), then is the empty set and = G_{ p }. If we define and then Allen et al. [5] show that the conditional likelihood of G_{ o }, given missingness and the offspring being diseased can be written as
where Pr(G_{ o }G_{ p }, D = 1)are the transmission probabilities conditional on parental genotype given by Schaid and Sommer [6]. Over the entire sample, the likelihood is
where G_{ oi }and are the offspring genotype and observed parental information for the i^{th} proband.
In order to specify equation (1) fully we need a model for θ_{ R }(G_{ p }) and p_{0} (G_{ p }) (we will only be optimizing this likelihood under the null so Pr(G_{ o }G_{ p }, D = 1) is purely combinatoric). We assumed assortative mating and allowed for departures from HardyWeinberg equilibrium by estimating the value of fixation index F [7]. Specifically, we assume p_{0} (G_{ p }) = p_{0} (G_{ f }) p_{0} (G_{ m }) where
and where genotype G consists of alleles k and l with allele frequencies π_{ k }and π_{ l }, respectively. We arrived at a loglinear model forθ_{ R }(G_{ p }). Specifically, we take
where
are the number of copies of allele k in the father's (mother's) genotype. This model was found to be both well identified and rich enough to handle a variety of realistic missing data scenariosallowing for differential maternal () and paternal () effects on an individual's missingness as well as an effect due to his or her spouse ().
Bootstrap calibration of MAR procedures
As mentioned above, tests derived from procedures assuming MAR will have improper size when informative missingness holds. In particular, our simulations show that Clayton's approach as implemented in the TRANSMIT program [2] can result in greatly inflated type I error rates when presented with plausible informative missingness scenarios. We propose a bootstrap calibration of MARbased tests using the informative missingness model presented above. The procedure, applied to the TRANSMIT approach of Clayton [2], is as follows.
First, we fit the informative missingness model by maximizing the conditional likelihood (equation (1)) under the null hypothesis of no transmission disequilibrium to obtain estimates . Note that under the null, Pr(G_{ o }G_{ p }, D = 1) is made up of known constants and need not be estimated.
With these parameters, we performed the following procedure:

1.
Sampled parental data from Pr(G_{ m }, G_{ f }D = 1, R; ).

2.
Given the full set of parental data, we imputed offspring data given the parental data by randomly sampling an allele from each parent's genotype.

3.
Calculated the test statistic T testing the null hypothesis of no association between disease and marker (or a particular allele) via TRANSMIT, using the imputed offspring data and the sampled parental data with the originally unobserved parent (if any) set to missing.
Steps 1–3 were repeated until B replicates had been obtained (we used B = 999). The 100 × (1  α)^{th} percentile of the empirical distribution of the test statistics {T_{1},...,T_{ B }} was taken as the critical value of an αlevel test calibrated for informative missingness. A test statistic t obtained from TRANSMIT applied to the original data can then be compared with this critical value to determine the significance of the results.
Example data
We applied this approach to 224 nuclear families extracted from the Framingham Heart Study pedigree data provided by Genetic Analysis Workshop 13 (GAW13). Individuals were selected based on the presence of both phenotypic and genotypic information. Individuals were given a hypertensive phenotype if they had hypertension at any exam or were taking medication for hypertension. Of nuclear families with at least one affected offspring, 77 had complete parental data; 96 had only maternal data; and 51 had only paternal data. We excluded probands without any parental genotype data because they are likely to contribute little information. For families with more than one affected offspring we randomly selected one proband. We focused on the chromosome 17q21q23 region, which had been linked to hypertension (as a quantitative trait) in previous studies [8, 9], containing markers GATA25A04 and ATC6A06 [8] as well as GATA49C09 [9]. Rare alleles were pooled with nearest repeats to maintain stable estimates. We tested for association between alleles at these markers and the hypertension phenotype using both the bootstrap recalibration procedure (adjusted for informative missingness) and the unadjusted (i.e., without the bootstrapping procedure) MARbased results from TRANSMIT. The results of our analysis are presented in Table 1.
Results
The bootstrapcalibrated and MARbased inferences corresponded well on marker GATA25A04. Results on markers GATA49C09 and ATC6A06 showed more discrepancies. Quantitative differences were evident at many of the alleles on these markers, especially the combined alleles 158 and 166 of marker GATA49C09. Though these effects for any given allele were marginal for ATC6A06, differences between the two procedures' overall chisquare tests at each marker were more substantial. An analysis of intact trios supported the conclusions of the bootstrapcalibrated inferences, finding no association with combined alleles 158 and 166 of marker GATA49C09 (results not shown). The intact trio analysis is valid under certain types of informative missingness, though at a loss of power relative to our bootstrap calibration approach.
Discussion
The discrepancies seen in this analysis between the MARbased and the bootstrapcalibrated tests may be due to the presence of informative missingness at markers GATA49C09 and ATC6A06 in this data set. This conclusion is supported by the intact trio analysis. In addition, the differences between the MARbased and bootstrapcalibrated pvalues observed were consistent with those documented in simulations [5]. In these simulations, informative missingness causes MARbased procedures to yield smallerthanwarranted pvalues, leading to an increased type I error rate. Moreover, informative missingness is certainly plausible in this region due to its close proximity to a number of cancer genes, including BRCA1. Further data including a denser marker set in this region will be helpful in confirming this possibility.
On the surface, it may appear that the lack of parental genotype information would make the problem of informative missingness intractable, or worse, that modelling informative missingness could lead to biased results through the introduction of unverifiable assumptions. However, there is, in fact, sufficient information in the way of constraints imposed by spouse and offspring genotypes to make estimation of the effect of genotype on missingness not only tractable but more robust than the standard MARbased analysis. Simulations suggest that even very mild informative missingness can have an enormous impact on the size of MARbased tests [5]. The bootstrap calibration approach proposed here protects against this inflation with minimal impact on power.
References
 1.
Weinberg CR: Allowing for missing parents in genetic studies of caseparent triads. Am J Hum Genet. 1999, 64: 11861193. 10.1086/302337.
 2.
Clayton DA: Generalization of the transmission/disequilibrium test for uncertain haplotype transmission. Am J Hum Genet. 1999, 65: 11701177. 10.1086/302577.
 3.
Sun FZ, Flanders WD, Yang QH, Khoury MJ: A new method for estimating the risk ratio in studies using caseparental control design. Am J Epidemiol. 1998, 148: 902909.
 4.
Sun FZ, Flanders WD, Yang QH, Khoury MJ: Transmission disequilibrium test (TDT) when only one parent is available: the 1TDT. Am J Epidemiol. 1999, 150: 97104.
 5.
Allen AS, Rathouz PJ, Satten GA: Informative missingness in genetic association studies: caseparent designs. Am J Hum Genet. 2003, 72: 671680. 10.1086/368276.
 6.
Schaid DJ, Sommer SS: Genotype relative risksmethods for design and analysis of candidategene association studies. Am J Hum Genet. 1993, 53: 11141126.
 7.
Hartl DL, Clark AG: Principles of Population Genetics. Sunderland, MA, Sinauer Associates. 1997, 3
 8.
Levy D, DeStefano AL, Larson MG, O'Donnell CJ, Lifton RP, Gavras H, Cupples A, Myers RH: Evidence for a gene influencing blood pressure on chromosome 17. Hypertension. 2000, 36: 477483.
 9.
O'Donnell CJ, Lindpainter K, Larson MG, Rao VS, Ordovas JM, Schaefer EJ, Myers RH, Levy D: Evidence for association and genetic linkage of the angiotensinconverting enzyme locus with hypertension and blood pressure in men but not women in the Framingham Heart Study. Circulation. 1998, 97: 17661772.
Author information
Rights and permissions
About this article
Cite this article
Allen, A.S., Collins, J.S., Rathouz, P.J. et al. Bootstrap calibration of TRANSMIT for informative missingness of parental genotype data. BMC Genet 4, S39 (2003) doi:10.1186/147121564S1S39
Published
DOI
Keywords
 Framingham Heart Study
 Parental Genotype
 Parental Data
 Affected Offspring
 Offspring Genotype