Bootstrap calibration of TRANSMIT for informative missingness of parental genotype data

Informative missingness of parental genotype data occurs when the genotype of a parent influences the probability of the parent's genotype data being observed. Informative missingness can occur in a number of plausible ways and can affect both the validity and power of procedures that assume the data are missing at random (MAR). We propose a bootstrap calibration of MAR procedures to account for informative missingness and apply our methodology to refine the approach implemented in the TRANSMIT program. We illustrate this approach by applying it to data on hypertensive probands and their parents who participated in the Framingham Heart Study.


Background
Missing parental genotype data is a common problem in association studies utilizing parental controls and has led to the development of a variety of approaches aimed at extending standard methods such as the transmission-disequilibrium test (TDT) to allow for a parent's genotype to be missing [1][2][3][4]. A common assumption made in such approaches [1,2] is that parental missingness is not related to the underlying, unobserved, genotype of the missing parent. This assumption is referred to as noninformative missingness or missingness at random (MAR) and allows reconstruction of the missing parent's genotype using the genotype frequencies among the observed parents and constraints imposed by spouse and offspring genotypes. Informative missingness, on the other hand, occurs when a parent's missingness is related to his or her genotype at the locus of interest. In this case, the distribution of genotypes in missing parents cannot be immedi-ately constructed from the distribution of genotypes among parents in intact trios. Therefore, when informative missingness is present, procedures that assume MAR will tend to reconstruct the genotypes of missing parents incorrectly, leading to biased results.
Informative missingness can occur in several ways. First, the parent's genotype may in fact be associated with the disease of interest which in turn, if manifest in the parent, may cause or influence missingness. Second, the genotype may be associated with a different disease that results in parental missingness. Finally, informative missingness can arise strictly as an artifact of population stratification. Consider a population comprising a number of subpopulations with varying allele frequencies and assume that the probability of a parent being missing also varies across these subpopulations. Even though there may be no relationship between missingness and parental genotype in any given subpopulation, in the overall population, the two may be correlated, resulting in informative missingness. Note that the second and third situations described above affect the null distribution of parental controlled association tests and can, as a result, lead to invalid inference. The first situation only affects the alternative hypothesis and hence only affects power.
Allen et al. [5] developed parental controlled association tests that are valid when parental genotype data are informatively missing. This approach retains comparable power when the data are MAR but assumes the marker being considered is bi-allelic. Here we propose a new approach to multiallele association testing that is robust to informative missingness. This approach uses a multiallele extension of the missingness model presented in Allen et al. [5] and a bootstrapping procedure to recalibrate MAR-based methods to account for informative missingness. In the next section we present the missingness model and how the bootstrap calibration procedure can be used to correct MAR-based TRANSMIT results for informative missingness. We illustrate our methodology by applying it, as well as an unadjusted MAR-based approach, to data on hypertensive probands and their parents who participated in the Framingham Heart Study. We conclude with a discussion of the results of this analysis including the possibility of informative missingness in this data set.

A model for informative missingness
Assume a trio-based sampling design in which N individuals with disease or trait of interest (denoted by D = 1) are sampled and, when possible, their parents are also recruited. At a locus of interest, let the proband genotype be denoted G o and let G f (G m ) denote the paternal (maternal) genotypes. Let G p = (G f , G m ) and assume there are K alleles at the locus of interest. Let R be an indicator of missingness so that if neither parent is missing, R = (0, 1) if the father but not the mother is missing, R = (1, 0) if the mother but not the father is missing, and R = (0, 0) if both parents are missing. Let and denote the observed and missing parental genotype information respectively, then is the empty set and = G p . If we define and then Allen et al. [5] show that the conditional likelihood of G o , given missingness and the offspring being diseased can be written as where Pr(G o |G p , D = 1)are the transmission probabilities conditional on parental genotype given by Schaid and Sommer [6]. Over the entire sample, the likelihood is where G oi and are the offspring genotype and observed parental information for the i th proband.
In order to specify equation (1) fully we need a model for θ R (G p ) and p 0 (G p ) (we will only be optimizing this likelihood under the null so Pr(G o |G p , D = 1) is purely combinatoric). We assumed assortative mating and allowed for departures from Hardy-Weinberg equilibrium by estimating the value of fixation index F [7]. Specifically, we and where genotype G consists of alleles k and l with allele frequencies π k and π l , respectively. We arrived at a log-linear model forθ R (G p ). Specifically, we take where are the number of copies of allele k in the father's (mother's) genotype. This model was found to be both well identified and rich enough to handle a variety of realistic missing data scenariosallowing for differential maternal ( ) and paternal ( ) effects on an individual's missingness as well as an effect due to his or her spouse ( ).

Bootstrap calibration of MAR procedures
As mentioned above, tests derived from procedures assuming MAR will have improper size when informative missingness holds. In particular, our simulations show that Clayton's approach as implemented in the TRANS-MIT program [2] can result in greatly inflated type I error rates when presented with plausible informative missingness scenarios. We propose a bootstrap calibration of MAR-based tests using the informative missingness model presented above. The procedure, applied to the TRANS-MIT approach of Clayton [2], is as follows.
First, we fit the informative missingness model by maximizing the conditional likelihood (equation (1)) under the null hypothesis of no transmission disequilibrium to obtain estimates . Note that under the null, Pr(G o |G p , D = 1) is made up of known constants and need not be estimated.
With these parameters, we performed the following procedure: 1. Sampled parental data from Pr(G m , G f |D = 1, R; ).
2. Given the full set of parental data, we imputed offspring data given the parental data by randomly sampling an allele from each parent's genotype.
3. Calculated the test statistic T testing the null hypothesis of no association between disease and marker (or a particular allele) via TRANSMIT, using the imputed offspring data and the sampled parental data with the originally unobserved parent (if any) set to missing.
Steps 1-3 were repeated until B replicates had been obtained (we used B = 999). The 100 × (1 -α) th percentile of the empirical distribution of the test statistics {T 1 ,...,T B } was taken as the critical value of an α-level test calibrated for informative missingness. A test statistic t obtained from TRANSMIT applied to the original data can then be compared with this critical value to determine the significance of the results.

Example data
We applied this approach to 224 nuclear families extracted from the Framingham Heart Study pedigree data provided by Genetic Analysis Workshop 13 (GAW13). Individuals were selected based on the presence of both phenotypic and genotypic information. Individuals were given a hypertensive phenotype if they had hypertension at any exam or were taking medication for hypertension. Of nuclear families with at least one affected offspring, 77 had complete parental data; 96 had only maternal data; and 51 had only paternal data. We excluded probands without any parental genotype data because they are likely to contribute little information. For families with more than one affected offspring we randomly selected one proband. We focused on the chromosome 17q21-q23 region, which had been linked to hypertension (as a quantitative trait) in previous studies [8,9], containing markers GATA25A04 and ATC6A06 [8] as well as GATA49C09 [9]. Rare alleles were pooled with nearest repeats to maintain stable estimates. We tested for association between alleles at these markers and the hypertension phenotype using both the bootstrap recalibration procedure (adjusted for informative missingness) and the unadjusted (i.e., without the bootstrapping procedure) MAR-based results from TRANSMIT. The results of our analysis are presented in Table 1.

Results
The bootstrap-calibrated and MAR-based inferences corresponded well on marker GATA25A04. Results on markers GATA49C09 and ATC6A06 showed more discrepancies. Quantitative differences were evident at many of the alleles on these markers, especially the combined alleles 158 and 166 of marker GATA49C09. Though these effects for any given allele were marginal for ATC6A06, differences between the two procedures' overall chi-square tests at each marker were more substantial. An analysis of intact trios supported the conclusions of the bootstrap-calibrated inferences, finding no association with combined alleles 158 and 166 of marker GATA49C09 (results not shown). The intact trio analysis is valid under certain types of informative missingness, though at a loss of power relative to our bootstrap calibration approach.

Discussion
The discrepancies seen in this analysis between the MARbased and the bootstrap-calibrated tests may be due to the presence of informative missingness at markers GATA49C09 and ATC6A06 in this data set. This conclusion is supported by the intact trio analysis. In addition, the differences between the MAR-based and bootstrap-calibrated p-values observed were consistent with those documented in simulations [5]. In these simulations, informative missingness causes MAR-based procedures to yield smaller-than-warranted p-values, leading to an increased type I error rate. Moreover, informative missingness is certainly plausible in this region due to its close proximity to a number of cancer genes, including BRCA1. Further data including a denser marker set in this region will be helpful in confirming this possibility.
On the surface, it may appear that the lack of parental genotype information would make the problem of informative missingness intractable, or worse, that modellinĝˆ, ...,ˆ,ˆ,ˆ,ˆ,ˆ,...,ˆ,ˆ,Φ = π π γ γ γ γ γ γ informative missingness could lead to biased results through the introduction of unverifiable assumptions. However, there is, in fact, sufficient information in the way of constraints imposed by spouse and offspring genotypes to make estimation of the effect of genotype on missingness not only tractable but more robust than the standard MAR-based analysis. Simulations suggest that even very mild informative missingness can have an enormous impact on the size of MAR-based tests [5]. The bootstrap calibration approach proposed here protects against this inflation with minimal impact on power.