Skip to main content
  • Methodology article
  • Open access
  • Published:

Rare variant association analysis in case-parents studies by allowing for missing parental genotypes

Abstract

Background

The development of next-generation sequencing technologies has facilitated the identification of rare variants. Family-based design is commonly used to effectively control for population admixture and substructure, which is more prominent for rare variants. Case-parents studies, as typical strategies in family-based design, are widely used in rare variant-disease association analysis. Current methods in case-parents studies are based on complete case-parents data; however, parental genotypes may be missing in case-parents trios, and removing these data may lead to a loss in statistical power. The present study focuses on testing for rare variant-disease association in case-parents study by allowing for missing parental genotypes.

Results

In this report, we extended the collapsing method for rare variant association analysis in case-parents studies to allow for missing parental genotypes, and investigated the performance of two methods by using the difference of genotypes between affected offspring and their corresponding “complements” in case-parent trios and TDT framework. Using simulations, we showed that, compared with the methods just only using complete case-parents data, the proposed strategy allowing for missing parental genotypes, or even adding unrelated affected individuals, can greatly improve the statistical power and meanwhile is not affected by population stratification.

Conclusions

We conclude that adding case-parents data with missing parental genotypes to complete case-parents data set can greatly improve the power of our strategy for rare variant-disease association.

Background

The development of next-generation sequencing technologies has facilitated association studies of rare variants (minor allele frequency (MAF) < 1%). Family-based design, as an important strategy in genetic studies (especially for rare variants) for human complex diseases, has some advantages over population-based design [1, 2]. The most prominent advantage is that many family-based association methods can effectively control for population admixture and substructure which is more prominent for rare variants and thus avoid spurious associations due to population admixture or substructure [3, 4]. Moreover, family-based design can be used to study complex mechanisms, such as parent-of-origin effects and maternally mediated genetic effects, which are difficult to detect with unrelated individuals in population-based design [5]. Case-parents study, as a typical strategy in family-based design, is widely used in rare variant-disease association analysis. For example, combined multivariate and collapsing (CMC) [6], weighted sum statistic (WSS) [7, 8], variable threshold (VT) [9], and the burden of rare variants (BRV) have all been extended into the transmission/disequilibrium test (TDT) [10] framework [11]. Another commonly used method in case-parents study is to treat nontransmitted genotypes of parents to affected offspring as control (also called pseudocontrols or complements) of affected offspring [5, 12, 13]. For example, investigators can construct a difference vector by comparing the genotypes of affected offspring with their corresponding “complements” and use the collapsing method [6, 7] to detect rare causal variants.

A problem with the use of case-parents study is that not all of the parental genotypes (one or both) are available in practice. For example, parents may have died, especially for older patients with Late-Onset diseases such as Alzheimer disease and hypertension, or parents may decline to participate in clinical research. It is often difficult to recruit large enough samples for case-parents study, especially for rare disease, and thus the sample size is generally small. Discarding those families with missing one or both parental genotypes can lead to statistical power loss. Statistical methods in case-parents study allowing for missing parental genotypes have been widely developed for common variant-disease association analysis [14, 15]. However, few works discuss rare variant-disease association in case-parents study when parental genotypes are missing. Because missing both parental genotypes implies case only (or unrelated affected individuals), allowing for missing one parental genotypes or case only will increase sample size in case-parents study and thus may enhance statistical power for rare variant association analysis. Therefore, it is useful to develop statistical approaches in case-parents study by allowing for missing parental genotypes to test rare variant-disease association.

In this report, we will extend the collapsing method for rare variant association analysis to case-parents study by using the genotype difference of affected offspring with their corresponding “complements” in case-parents trios and TDT framework. Our strategy allows for missing one or both parental genotypes (or case only). We develop our strategy in homogenous populations. Through simulation studies, we investigate the performance of the proposed method in a homogenous population as well as in populations with population stratification under three scenarios: complete case-parents data mixed with one parental genotypes missing, complete case-parents data mixed with both parental genotypes missing, and complete case-parents data mixed with one and both parental genotypes missing.

Methods

In this study, all datasets were publically available and no research requiring ethics approval was conducted.

Notation

Consider a data set in a homogenous population Ω = {Ω 0 , Ω Ι , Ω ΙΙ } consists of three types of case-parents trios with the genotype of affected offspring known in each family. Ω 0 , Ω Ι , and Ω ΙΙ denote three types of case-parents trios when there are 0, 1, and 2 missing parental genotypes, respectively. We consider three combinations of Ω 0 , Ω Ι , and Ω ΙΙ : Ω0 + Ι = {Ω 0 , Ω Ι }, Ω0 + ΙΙ = {Ω 0 , Ω ΙΙ }, and Ω0+Ι+ΙΙ = {Ω 0 , Ω Ι , Ω ΙΙ }. Ω0 + Ι is the samples data set consisting of complete case-parents trio with the known genotypes for each member in the trio (Ω 0 ) and case-parents trio with missing one parental genotype (type Ω Ι ). Ω0 + ΙΙ is the samples data set consisting of type Ω 0 and type Ω ΙΙ with missing genotypes of both parents. Ω0+Ι+ΙΙ includes sample data of all types of Ω 0 , Ω Ι , and Ω ΙΙ . We assume N case-parents trios with N0, NI, NII trios for Ω 0 , Ω Ι , and Ω ΙΙ , respectively, are sampled (N = N0+ NI + NII). Let G O be the minor allele count carried by the affected offspring. Let {GF, GM} be the minor allele count carried by parents in a case-parents trio. The curly braces indicate set notation rather than ordered pairs. For example, {GF, GM} = {1, 2} means GF = 1, GM = 2 or GF = 2, GM = 1. Let a triplet ({G F , G M }, G O ) be a case-parents trio.

Rare variants association analysis

Let x = 2G O  − G F  − G M be the paired difference in genotypes between the affected offspring and the complement (pseudo-control). We consider k variants with q causal variants in an interesting region, e.g., a gene region. The variants and case-parents trios are indexed by i and j (i = 1, , k; j = 1, 2, , N), respectively. We redefine a paired difference \( {\tilde{x}}_{ij} \) for jth trio at ith variant as flowing,

$$ {\tilde{x}}_{ij}=\left\{\begin{array}{c}{x}_{ij},\kern14em \left(\left\{{G}_{Fij},{G}_{Mij}\right\},{G}_{Oij}\right)\in {\Omega}_0\\ {}E\left({x}_{ij}|{G}_{Oij}|{G}_{Fij}\right)\ \mathrm{or}\ E\left({x}_{ij}|{G}_{Oij}|{G}_{Mij}\right),\left(\left\{{G}_{Fij},{G}_{Mij}\right\},{G}_{Oij}\right)\in {\Omega}_{\mathbf{I}}\\ {}E\left({x}_{ij}|{G}_{Oij}\right),\kern11.2em \left(\left\{{G}_{Fij},{G}_{Mij}\right\},{G}_{Oij}\right)\in {\Omega}_{\mathbf{I}\mathbf{I}}\end{array}\right. $$
(1)

We can calculate E(x|) under the assumption of random mating (thus Hardy-Weinberg equilibrium) and with the rule of genetic inheritance if ({G F  , G M }, G O )  Ω Ι  or Ω ΙΙ . For example, when ({G F  , G M }, G O )  Ω ΙΙ , P{{G F  , G M } = {1, 1}| G O  = 2} = (1 − MAF)2, P{{G F  , G M } = {1, 2}| G O  = 2} = 2MAF  (1 − MAF), and P{{G F  , G M } = {2, 2}| G O  = 2} = MAF2, then P{x = 0| G O  = 2} = P{{G F  , G M } = {2, 2}| G O  = 2} = MAF2, P{x = 1| G O  = 2} = P{{G F  , G M } = {1, 2}| G O  = 2} = 2MAF  (1 − MAF), and P{x = 2| G O  = 2} = P{{G F  , G M } = {1, 1}| G O  = 2} = (1 − MAF)2. Thus

$$ E\left(x|{G}_O=2\right)=\underset{i=0}{\overset{2}{\Sigma}}P\left\{x=i|{G}_O=2\right\}\cdot i=2\left(1-\mathrm{MAF}\right) $$
(2)

We use the known parental genotypes or the background-population of samples to estimate MAF. Other E(x| ) can be calculated similar to Eq. (2).

The collapsing method for rare variants can be directly extended to family-based study with the difference vectors in case-parents data. We denote this method as Z c which can be defined as

$$ {Z}_C=\frac{U}{\sqrt{Var(U)}} $$
(3)

where \( U={\mathbf{1}}^T\overline{X} \), 1 is a k-dimensional vector 1 = (1, , 1)T, \( \overline{X}=\frac{1}{N}{\left(\underset{j=1}{\overset{N}{\Sigma}}{x}_{1j},\underset{j=1}{\overset{N}{\Sigma}}{x}_{2j},\cdots, \underset{j=1}{\overset{N}{\Sigma}}{x}_{kj}\right)}^T \), \( {\sigma}_{ij}=\frac{1}{\left(N-1\right)}\underset{r,\mathrm{s}=1}{\overset{N}{\Sigma}}\left({x}_{ir}-\frac{1}{N}\underset{r=1}{\overset{N}{\Sigma}}{x}_{ir}\right)\left({x}_{js}-\frac{1}{N}\underset{s=1}{\overset{N}{\Sigma}}{x}_{js}\right), \) and \( Var(U)=\frac{1}{N}\underset{\mathrm{i},\mathrm{j}=1}{\overset{k}{\Sigma}}{\sigma}_{ij} \). When consider missing parental genotypes, we substitute \( {\tilde{x}}_{ij} \) for x ij and denote the test statistic by \( {\tilde{Z}}_C \),

$$ {\tilde{Z}}_C=\frac{\tilde{U}}{\sqrt{Var\left(\tilde{U}\right)}} $$
(4)

In the TDT framework, we let b ij be the number of the minor allele transmitted from heterozygous parent to the affected offspring at variant i in jth trio and c ij be the number of the major allele transmitted from heterozygous parent to the affected offspring at variant i in jth trio. Let \( {b}_i=\underset{j}{\Sigma}{b}_{ij} \)be the total number of minor-allele-transmitted from heterozygous parents to the affected offspring at ith variant and \( {c}_i=\underset{j}{\Sigma}{c}_{ij} \) is the total number of major-allele-transmitted from heterozygous parents to the affected offspring at variant i. The collapsing method for rare variants in TDT framework (corresponding to TDTBRV in He et al. 2014) is

$$ {TDT}_{\mathrm{BRV}}=\frac{{\left({\Sigma}_i^k{b}_i-{\Sigma}_i^k{c}_i\right)}^2}{\Sigma_i^k{b}_i+{\Sigma}_i^k{c}_i} $$
(5)

When consider missing parental genotypes, we define \( {\tilde{c}}_{ij} \) and \( {\tilde{b}}_{ij} \) as following,

$$ {\tilde{b}}_{ij}=\left\{\begin{array}{c}\ {b}_{ij},\kern6em \left(\left\{{G}_{Fij},{G}_{Mij}\right\},{G}_{Oij}\right)\in {\Omega}_0\ \\ {}E\left({b}_{ij}|{G}_{Oij},{G}_{Fij}\right),\kern1.4em \left(\left\{{G}_{Fij},{G}_{Mij}\right\},{G}_{Oij}\right)\in {\Omega}_{\mathbf{I}}\\ {}\ E\left({b}_{ij}|{G}_{Oij}\right),\kern3.5em \left(\left\{{G}_{Fij},{G}_{Mij}\right\},{G}_{Oij}\right)\in {\Omega}_{\mathbf{I}\mathbf{I}}\end{array}\right., $$
$$ {\tilde{c}}_{ij}=\left\{\begin{array}{c}{c}_{ij},\kern6em \left(\left\{{G}_{Fij},{G}_{Mij}\right\},{G}_{Oij}\right)\in {\Omega}_0\\ {}E\left({c}_{ij}|{G}_{Oij},{G}_{Fij}\right),\kern0.5em \left(\left\{{G}_{Fij},{G}_{Mij}\right\},{G}_{Oij}\right)\in {\Omega}_{\mathbf{I}}\\ {}E\left({c}_{ij}|{G}_{Oij}\right),\kern2.75em \left(\left\{{G}_{Fij},{G}_{Mij}\right\},{G}_{Oij}\right)\in {\Omega}_{\mathbf{I}\mathbf{I}}\end{array}\right. $$

In Additional file 1: Table S1 and Additional file 2: Table S2 present all the expectations ofE(x| ), E(b| ), and E(c| ) when ({G F , G M }, G O )  Ω Ι and ({G F , G M }, G O )  Ω ΙΙ , respectively. We substitute \( {\tilde{b}}_{ij} \) and \( {\tilde{c}}_{ij} \)for b ij and c ij , and denote the test statistic of the TDTBRV method by \( {TDT}_{\mathrm{BRV}} \):

$$ {TDT}_{BRV}=\frac{{\left({\Sigma}_i^k{\tilde{c}}_i-{\Sigma}_i^k{\tilde{b}}_i\right)}^2}{\Sigma_i^k{\tilde{c}}_i+{\Sigma}_i^k{\tilde{b}}_i} $$
(6)

Results

Simulation setting

To assess the performance of our method, we perform a series of simulation studies under a wide range of parameter values. The simulation parameter includes the total number of variants (k), the MAF of each variant, the number (q) and effect size (measured by the odds ratio (OR)) of causal variants, and the sample size (N) for case-parents trios with the number of case-parents trios for Ω 0 (N0), Ω Ι (NI), and Ω ΙΙ (NII). We simulate two populations.

In the first population, 1000 case-parents families are generated and the parameters are chosen as follows: k = 20; q = 0.2 k, 0.4 k, 0.6 k, 0.8 k; MAF (0.001, 0.01) with uniform distribution for each variant. Under the null hypothesis of no association, we set OR = 1 for all the variants. Under the alternative hypothesis of association, we set OR = 1 for non-causal variants. Under the alternative hypothesis, two scenarios are considered for the effects of causal variants. First, the causal variants have the same positive direction but different effects. Here we set OR[1.2, 3] in arithmetic progression. Second, the causal variants have opposite effects. Here we set OR [0.2, 0.9] [1.2, 3] with half of causal variants belonging to [0.2, 0.9] and the other causal variants belonging to [1.2, 3] in arithmetic progression.

In the second population, 500 case-parents families and a number of unaffected individuals are generated (here, 500 unaffected individuals are generated and they are used to estimate MAF when samples come from the second population). The parameter settings are similar to those in the first population except that the OR of causal variants under the alternative hypothesis. We let the OR of each causal variant in the second population be 0.1 less than that in the first population.

Once the parameter values are chosen, we first generate parental haplotypes based on a latent variable Z = (Z1, , Z k ) from a multivariate normal distribution with marginal standard normal and covariance structure as described below [16, 17]: if variants i and j are both causal or both non-causal, then the correlation is set to be Corr (Z i , Z j ) =0.4i − j; otherwise the correlation is zero. We transform Z i to 0 (major allele) or 1 (minor allele) according to the MAF of the ith variant and combine two haplotypes to obtain the parental haplotypes [16, 17]. Offspring haplotypes are generated from the parental haplotype assuming no recombination between the variants. The disease status for an offspring’s phenotype is determined by the following logistic model [18]:

$$ P\left( Affected|{G}_{Oij},i=1,\cdots, k\right)=\frac{1}{1+\exp \left(-\gamma \right)}, $$
$$ \gamma =\ln \left(\frac{c}{1-c}\right)+\underset{i=1}{\overset{k}{\Sigma}}\ln \left(\mathrm{O}{\mathrm{R}}_i\right)\cdot {G}_{Oij} $$

where OR i is the odds ratio of ith variant, G Oij is the minor allele count carried by the affected offspring in the jth trio at the ith variant, and c is the background prevalence of being affected for a subject with no minor alleles. Here, we let c = 0.01 in the first population and c = 0.008 in the second population.

The 1000 case-parents trios in the first population are composed of three types of trios: 500 (N0) forΩ 0 , 250 for Ω Ι by randomly discarding one set of parental genotypes, and 250 for Ω ΙΙ by discarding both parental genotypes. There are two types of trios in the 500 case-parents trios in the second population: 250 for Ω Ι and 250 for Ω ΙΙ . In our analysis, we fix N0 (=500) and change NI and NII. We let NI and NII take the value of \( \frac{1}{10} \)N0, \( \frac{1}{5} \)N0, and \( \frac{1}{2} \)N0. We calculate Zc and TDTBRV in Ω 0 and \( {\tilde{Z}}_C \) and \( {TDT}_{\mathrm{BRV}} \) in Ω0 + Ι0 + ΙΙ, and Ω0+Ι+ΙΙ. The p-value of statistical tests is estimated by a permutation procedure as follows: First calculate the data-based statistic, then recalculate permutation-based statistic by randomly changing signs (positive or negative) of x ij for \( {\tilde{Z}}_C \) and permuting the “transmitted” and “not transmitted” labels randomly for \( {TDT}_{\mathrm{BRV}} \) with equal probability. We repeat this process 1000 times and p-value is estimated as the proportion of permutation-based statistics that are larger than the data-based statistic. For a given significance level α, the power/type I error rate is estimated as the proportion of rejecting the null hypothesis when p-value ≤α with 1000 replicates.

Type I error rates and power

We investigate the performance of our method in a homogeneous population and in populations with population stratification. For the homogeneous population, all samples come from the first population. For the population stratification, case-parents trios with missing parental genotypes come from the second population. We present in Table 1 the type I error rates when α = 0.05, 0.001. As shown in Table 1, for three situations of Ω0 + Ι, Ω0 + ΙΙ, and Ω0+Ι+ΙΙ, the type I error rates are well-controlled around the nominal levels. This indicates the validity of the method when considering missing one parental genotypes or case only even in population stratification.

Table 1 Type I error rate of \( {\tilde{Z}}_C \) and \( {TDT}_{\mathrm{BRV}} \)

We present in Tables 2, 3 and 4 the power of \( {\tilde{Z}}_C \) and\( {TDT}_{\mathrm{BRV}} \) in the homogeneous population for three situations of Ω0 + Ι, Ω0 + ΙΙ, and Ω0+Ι+ΙΙ, respectively, when causal variants have the same positive direction but different effects or causal variants have opposite effects. We can see from Table 1 that, when causal variants have different effects with the same direction and the proportion of non-causal variants is 80% or 60%, adding case-parents trios of Ω Ι to complete case-parents data set can increase the power of \( {\tilde{Z}}_C \) and \( {TDT}_{\mathrm{BRV}} \) for rare variants association analysis. For example, when there are 80% non-causal variants, adding \( \frac{1}{10} \)N0 (50), \( \frac{1}{5} \)N0 (100), and \( \frac{1}{2} \)N0(250) case-parents trios of Ω Ι to 500 complete case-parents trios improves the powers of \( {\tilde{Z}}_C \)and\( {TDT}_{\mathrm{BRV}} \) from 0.408 and 0.602 to 0.566 and 0.712, to 0.674 and 0.748, and to 0.752 and 0.784, respectively. We observed that, although the power of \( {\tilde{Z}}_C \) is lower than that of \( {TDT}_{\mathrm{BRV}} \)with the use of complete case-parents data, adding \( \frac{1}{2} \)N0 case-parents trios of Ω Ι to complete case-parents trios helps \( {\tilde{Z}}_C \) achieve similar power as that of \( {TDT}_{\mathrm{BRV}} \). We also noted that, when the number of non-causal variants is small (40% or 20%), since the two statistics have high power just by using 500 complete case-parents trios, adding case-parents trios of Ω Ι does not help to improve power. As we decrease the sample size to 200, adding case-parents trios of Ω Ι can still improve power of \( {\tilde{Z}}_C \) and \( {TDT}_{\mathrm{BRV}} \)(data not shown). When causal variants have opposite effects, we also observed that adding case-parents trios of Ω Ι can improve the statistical power.

Table 2 Empirical power at the 0.05 significance level for Ω0 + I in the homogenous population
Table 3 Empirical power at the 0.05 significance level for Ω0 + II in the homogenous population
Table 4 Empirical power at the 0.05 significance level for Ω0 + I + II in the homogenous population

In order to further show the magnitude of power improvement of \( {\tilde{Z}}_C \) and \( {TDT}_{\mathrm{BRV}} \), we present in parentheses in Tables 2, 3 and 4 the proportion of power improved by adding case-parents trios of Ω Ι , Ω ΙΙ , and ΩΙ+ΙΙto complete case-parents data set Ω 0 . It can be found from Table 2 that the proportion of power improvement drops with a decrease in the number of non-causal variants, and the proportion of power improvement for \( {\tilde{Z}}_C \) is higher than that for \( {TDT}_{\mathrm{BRV}} \). When causal variants have opposite effects, we observed that the proportion of power improvement is larger than that when causal variants have the same direction. As the proportion of non-causal variants decreases from 80% to 60%, the proportion of power improvement increases. For example, while the powers of \( {TDT}_{\mathrm{BRV}} \) and \( {\tilde{Z}}_C \) for 80% non-causal variants have improved by 20.4% to 38.1% and 62.9% to 117% with the number of case-parents trios of Ω Ι increasing from \( \frac{1}{10} \)N0 to \( \frac{1}{2} \)N0, respectively, the powers of \( {TDT}_{\mathrm{BRV}} \) and \( {\tilde{Z}}_C \) for 60% non-causal variants have improved by 25.8 to 61.3% and 89.0 to 239%, respectively. However, the proportion of power improvement drops with a further decrease in the number of non-causal variants. For example, with the number of case-parents trios of Ω Ι increasing from \( \frac{1}{10} \)N0to \( \frac{1}{2} \)N0, the proportions of power improvement of \( {TDT}_{\mathrm{BRV}} \) change from7.58 to 36.4% for 40% non-causal variants and from 5.76 to 35.2% for 20% non-causal variants, respectively, and the proportions of power improvement of \( {\tilde{Z}}_C \) change from 34.0 to 96.6% for 40% non-causal variants and from 31.3 to 98.7% for 20% non-causal variants, respectively. This result indicates that, when causal variants have opposite effects, the proportion of power improvement increases early then decreases later with the increase in the number of non-causal variants. Adding case-parents trios of Ω Ι with the medium number of non-causal variants is best for power improvement. For Ω0 + ΙΙand Ω0+Ι+ΙΙ, Tables 3 and 4 show similar results as those for Ω0 + Ι. In addition, we observed that the proportion of power improvement for Ω0 + Ι is the largest among three situations of Ω0 + Ι0 + ΙΙ, and Ω0+Ι+ΙΙ.

When there is population stratification, Additional file 3: Figure S1-S4 shows the power of \( {\tilde{Z}}_C \) and \( {TDT}_{\mathrm{BRV}} \)against the sample size for various proportions of non-causal variants under three situations of Ω0 + Ι0 + ΙΙ, and Ω0+Ι+ΙΙ, respectively. The results are similar to those in the homogeneous population. We also consider a general situation for population stratification: Samples from two populations both consist of case-parents trios of three types, Ω 0 , Ω Ι , and Ω ΙΙ . The simulation results are similar to those in Additional file 3: Figures S1-S4 (data not shown). These results indicate that, when adding case-parents trios with missing parental genotypes or even case only to complete a case-parents data set, population stratification does not affect the power of these two statistics for rare variant association analysis.

Discussion

In this report, we considered case-parents data with missing parental genotypes for rare variant association analysis in case-parents studies. Based on the collapsing method with the difference vector and TDT framework, we presented two statistics, \( {\tilde{Z}}_C \) and \( {TDT}_{\mathrm{BRV}} \), allowing for missing parental genotypes. The key in the proposed approach is to estimate the MAF. Actually, in clinical research, experimental design is usually done for a homogenous population or several specific populations. One can use the known parental genotypes or the background-population of samples to estimate MAF. We investigated the performance of these two statistics in three different situations: complete case-parents data mixed with one parental genotype missing, complete case-parents data mixed with both parental genotypes missing, and complete case-parents data mixed with one and both parental genotypes missing. Through simulation studies, we found that adding case-parents data with missing parental genotypes to complete case-parents data set can greatly improve the power of these two statistics, though the proportion of power improvement varied. Additionally, our strategy is not affected by population stratification.

In most studies of disease associations with rare variants, family- and population-based samples were used separately [6, 7, 11, 19, 20]. Although family-based studies have several advantages over population-based studies in rare variant association analysis, it is often difficult to recruit sufficiently large family-based samples, especially for rare diseases. More often, information about parents is incomplete, and this poses some challenges in analysis. Discarding those families with missing parental genotypes will further reduce the sample size and result in a loss of statistical power. In our strategy, case-parents trios missing one or both parental genotypes are kept in analysis and thus can help to greatly improve statistical power. Furthermore, we can see that missing both parental genotypes corresponds to case only. This means we can use unrelated affected individuals in case-parents studies, which is useful for case-parents studies with small sample size. Although population stratification might exist in these unrelated affected individuals recruited from population-based samples, our strategy is not affected by population stratification. Our simulation results showed that combining unrelated affected individuals with complete case-parents data could increase power by 5 ~ 60% for \( {TDT}_{\mathrm{BRV}} \)and 20 ~ 200% for\( {\tilde{Z}}_C \) in both homogenous populations and populations with population stratification.

In addition to allowing for missing parental genotypes, our method can be used to address another problem when there are missing genotypes for individual variants in parental data. In fact, when individual variants are analyzed and there are missing genotypes for some variants, removing those samples for variants with missing genotypes will result in inconsistency of the sample size. With the strategy described above, our method can overcome this problem. However, our strategy is not suitable for case-parents trios with missing offspring genotypes, so further study is needed to address such scenarios.

Conclusions

The proposed strategy allowing for missing parental genotypes, or even adding unrelated affected individuals, can greatly improve the statistical power for rare variant-disease association and meanwhile is not affected by population stratification.

Abbreviations

BRV:

Burden of rare variants

CMC:

Combined multivariate and collapsing

LD:

Linkage disequilibrium

MAF:

Minor allele frequency

OR:

Odds ratio

TDT:

Transmission/disequilibrium test

VT:

Variable threshold

WSS:

Weighted sum statistic

References

  1. Ott J, Kamatani Y, Lathrop M. Family-based designs for genome-wide association studies. Nat Rev Genet. 2011;12(7):465–74.

    Article  CAS  PubMed  Google Scholar 

  2. Mathieson I, McVean G. Differential confounding of rare and common variants in spatially structured populations. Nat Genet. 2012;44(3):243–6.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Liu J, Lewinger JP, Gilliland FD, Gauderman WJ, Conti DV. Confounding and heterogeneity in genetic association studies with admixed populations. Am J Epidemiol. 2013;177(4):351–60.

    Article  PubMed  PubMed Central  Google Scholar 

  4. He Z, Zhang D, Renton AE, Li B, Zhao L, Wang GT, Goate AM, Mayeux R, Leal SM. The rare-variant generalized disequilibrium test for association analysis of nuclear and extended pedigrees with application to alzheimer disease WGS data. Am J Hum Genet. 2017;100(2):193–204.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Shi M, Umbach DM, Weinberg CR. Identification of risk-related haplotypes with the use of multiple SNPs from nuclear families. Am J Hum Genet. 2007;81(1):53–66.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Li B, Leal SM. Methods for detecting association with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008;83(3):311–21.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Lin DY, Tang ZZ. A general framework for detecting disease associations with rare variants in sequencing studies. Am J Hum Genet. 2011;89(3):354–67.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Madsen BE, Browning SR. A groupwise association test for rare mutations using a weighter sum statistic. PLoS Genet. 2009;5(2):e1000384.

    Article  PubMed  PubMed Central  Google Scholar 

  9. Price AL, Kryukov GV, Bakker PIW, Purcell SM, Staples J, Wei LJ, Sunyaev SR. Pooled association tests for rare variants in exon-resequencing studies. Am J Hum Genet. 2010;86(6):832–8.

    Article  PubMed  PubMed Central  Google Scholar 

  10. Spielman RS, McGinnis RE, Ewens WJ. Transmission test for linkage disequilibrium: the insulin gene region and insulindependent diabetes mellitus (IDDM). Am J Hum Genet. 1993;52(3):506–16.

    CAS  PubMed  PubMed Central  Google Scholar 

  11. He Z, O’Roak BJ, Smith JD, Wang G, Hooker S, Santos-Cortez RLP, Li B, Kan M, Krumm N, Nickerson DA, Shendure J, Eichler EE, Leal SM. Rare-variant extensions of the transmission disequilibrium test: application to autism exome sequence data. Am J Hum Genet. 2014;94(1):33–46.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. McIntyre LM, Martin ER, Simonsen KL, Kaplan NL. Circumventing multiple testing: a multilocus Monte Carlo approach to testing for association. Genet Epidemiol. 2000;19(1):18–29.

    Article  CAS  PubMed  Google Scholar 

  13. Li YM, Xiang Y. Detecting disease association with rare variants in case- parents studies. J Hum Genet. 2017;62(5):549–52.

    Article  CAS  PubMed  Google Scholar 

  14. Allen AS, Rathouz PJ, Satten GA. Informative missingness in genetic association studies: case-parent designs. Am J Hum Genet. 2003;72(3):671–80.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Sebastiani P, Abad MM, Alpargu G, Ramoni MF. Robust transmission/ disequilibrium test for incomplete family genotypes. Genetics. 2004;168(4):2329–37.

    Article  PubMed  PubMed Central  Google Scholar 

  16. Basu S, Pan W. Comparison of statistical tests for disease association with rare variants. Genet Epidemiol. 2011;35(7):606–19.

    Article  PubMed  PubMed Central  Google Scholar 

  17. Sun L, Wang C, Hu YQ. Utilizing mutual information for detecting rare and common variants associated with a categorical trait. Peer J. 2016;4:e2139.

    Article  PubMed  PubMed Central  Google Scholar 

  18. Preston MD, Dudbridge F. Utilising family-based designs for detecting rare variant disease associations. Ann Hum Genet. 2014;78(2):129–40.

    Article  PubMed  PubMed Central  Google Scholar 

  19. MC W, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011;89(1):82–93.

    Article  Google Scholar 

  20. Ionita-Laza I, Lee S, Makarov V, Buxbaum JD, Lin X. Family-based association tests for sequence data, and comparisons with population-based association tests. Eur J Hum Genet. 2013;21(10):1158–62.

    Article  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgments

LYM was partially supported by National Natural Science Foundation of China (11301206), Scientific Research Fund of Hunan Provincial Education Department (16A166), Hunan Provincial Natural Science Foundation of China (2017JJ2212), and China Scholarship Council (National cooperation fund of Hunan Province). HWD was partially supported by grants from the National Institutes of Health (R01AR057049, R01AR059781, D43TW009107, P20 GM109036, R01MH107354, R01MH104680, R01GM109068, R01AR069055), the Edward G. Schlieder Endowment fund to Tulane University. The authors would like to appreciate the assistance of Loula Burton, Office of Research in Tulane University, in editing the manuscript.

Funding

This work was financially supported by the funding sponsors of National Natural Science Foundation of China (11301206),Scientific Research Fund of Hunan Provincial Education Department (16A166), and Hunan Provincial Natural Science Foundation of China (2017JJ2212).

Availability of data and materials

All data generated or analysed during this study are included in this published article.

Author information

Authors and Affiliations

Authors

Contributions

LYM conceived the idea, designed the study, and wrote the manuscript. XY developed the statistical method. XC, ShH, and DHW revised the manuscript. All authors have read and approved the final version of the manuscript.

Corresponding authors

Correspondence to Yumei Li or Hongwen Deng.

Ethics declarations

Ethics approval and consent to participate

This study has not directly involved humans, animals or plants. So no consent to participate was required.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional files

Additional file 1: Table S1.

All the expectations of E(x),E(b),and E(c) when (GF, GM ,GO)ΩI. (PDF 109 kb)

Additional file 2: Table S2.

All the expectations of E(x),E(b),and E(c) when (GF, GM ,GO)ΩII. (PDF 88 kb)

Additional file 3: Figure S1.

Empirical power against the sample size at the 0.05 significance level in population stratification when there are 20% non-causal variants. Note: A and B are for\( {\tilde{Z}}_C \), and C and D are for \( {TDT}_{\mathrm{BRV}} \)when causal variants have different effects with the same direction and causal variants have opposite effects, respectively. The sample size N=N0, N0 + 1/10 N0, N0 + 1/5 N0, N0 + 1/2 N0 with N0 = 500 denoted by 0, 1/10, 1/5, and 1/2 respectively. Ω0 + I (), Ω0 + II (*), Ω0 + I + II (+). Figure S2. Empirical power against the sample size at the 0.05 significance level in population stratification when there are 40% non-causal variants. Note: A and B are for\( {\tilde{Z}}_C \), and C and D are for \( {TDT}_{\mathrm{BRV}} \) when causal variants have different effects with the same direction and causal variants have opposite effects, respectively. The sample size N=N0, N0 + 1/10 N0, N0 + 1/5 N0, N0 + 1/2 N0 with N0 = 500 denoted by 0, 1/10, 1/5, and 1/2 respectively. Ω0 + I (), Ω0 + II (*), Ω0 + I + II (+). Figure S3. Empirical power against the sample size at the 0.05 significance level in population stratification when there are 60% non-causal variants. Note: A and B are for \( {\tilde{Z}}_C \), and C and D are for \( {TDT}_{\mathrm{BRV}} \)when causal variants have different effects with the same direction and causal variants have opposite effects, respectively. The sample size N=N0, N0 + 1/10 N0, N0 + 1/5 N0, N0 + 1/2 N0 with N0 = 500 denoted by 0, 1/10, 1/5, and 1/2 respectively. Ω0 + I (), Ω0 + II (*), Ω0 + I + II (+). Figure S4. Empirical power against the sample size at the 0.05 significance level in population stratification when there are 80% non-causal variants. Note: A and B are for \( {\tilde{Z}}_C \), and C and D are for \( {TDT}_{\mathrm{BRV}} \) when causal variants have different effects with the same direction and causal variants have opposite effects, respectively. The sample size N=N0, N0 + 1/10 N0, N0 + 1/5 N0, N0 + 1/2 N0 with N0 = 500 denoted by 0, 1/10, 1/5, and 1/2 respectively. Ω0 + I (), Ω0 + II (*), Ω0 + I + II (+). (PDF 89 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, Y., Xiang, Y., Xu, C. et al. Rare variant association analysis in case-parents studies by allowing for missing parental genotypes. BMC Genet 19, 7 (2018). https://doi.org/10.1186/s12863-018-0597-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12863-018-0597-8

Keywords