Rare variant association analysis in case-parents studies by allowing for missing parental genotypes

Li, Yumei; Xiang, Yang; Xu, Chao; Shen, Hui; Deng, Hongwen

doi:10.1186/s12863-018-0597-8

Methodology article
Open access
Published: 15 January 2018

Rare variant association analysis in case-parents studies by allowing for missing parental genotypes

Yumei Li ORCID: orcid.org/0000-0003-1755-4504^1,2,
Yang Xiang¹,
Chao Xu²,
Hui Shen² &
…
Hongwen Deng^2,3

BMC Genetics volume 19, Article number: 7 (2018) Cite this article

1767 Accesses
2 Citations
Metrics details

Abstract

Background

The development of next-generation sequencing technologies has facilitated the identification of rare variants. Family-based design is commonly used to effectively control for population admixture and substructure, which is more prominent for rare variants. Case-parents studies, as typical strategies in family-based design, are widely used in rare variant-disease association analysis. Current methods in case-parents studies are based on complete case-parents data; however, parental genotypes may be missing in case-parents trios, and removing these data may lead to a loss in statistical power. The present study focuses on testing for rare variant-disease association in case-parents study by allowing for missing parental genotypes.

Results

In this report, we extended the collapsing method for rare variant association analysis in case-parents studies to allow for missing parental genotypes, and investigated the performance of two methods by using the difference of genotypes between affected offspring and their corresponding “complements” in case-parent trios and TDT framework. Using simulations, we showed that, compared with the methods just only using complete case-parents data, the proposed strategy allowing for missing parental genotypes, or even adding unrelated affected individuals, can greatly improve the statistical power and meanwhile is not affected by population stratification.

Conclusions

We conclude that adding case-parents data with missing parental genotypes to complete case-parents data set can greatly improve the power of our strategy for rare variant-disease association.

Background

The development of next-generation sequencing technologies has facilitated association studies of rare variants (minor allele frequency (MAF) < 1%). Family-based design, as an important strategy in genetic studies (especially for rare variants) for human complex diseases, has some advantages over population-based design [1, 2]. The most prominent advantage is that many family-based association methods can effectively control for population admixture and substructure which is more prominent for rare variants and thus avoid spurious associations due to population admixture or substructure [3, 4]. Moreover, family-based design can be used to study complex mechanisms, such as parent-of-origin effects and maternally mediated genetic effects, which are difficult to detect with unrelated individuals in population-based design [5]. Case-parents study, as a typical strategy in family-based design, is widely used in rare variant-disease association analysis. For example, combined multivariate and collapsing (CMC) [6], weighted sum statistic (WSS) [7, 8], variable threshold (VT) [9], and the burden of rare variants (BRV) have all been extended into the transmission/disequilibrium test (TDT) [10] framework [11]. Another commonly used method in case-parents study is to treat nontransmitted genotypes of parents to affected offspring as control (also called pseudocontrols or complements) of affected offspring [5, 12, 13]. For example, investigators can construct a difference vector by comparing the genotypes of affected offspring with their corresponding “complements” and use the collapsing method [6, 7] to detect rare causal variants.

A problem with the use of case-parents study is that not all of the parental genotypes (one or both) are available in practice. For example, parents may have died, especially for older patients with Late-Onset diseases such as Alzheimer disease and hypertension, or parents may decline to participate in clinical research. It is often difficult to recruit large enough samples for case-parents study, especially for rare disease, and thus the sample size is generally small. Discarding those families with missing one or both parental genotypes can lead to statistical power loss. Statistical methods in case-parents study allowing for missing parental genotypes have been widely developed for common variant-disease association analysis [14, 15]. However, few works discuss rare variant-disease association in case-parents study when parental genotypes are missing. Because missing both parental genotypes implies case only (or unrelated affected individuals), allowing for missing one parental genotypes or case only will increase sample size in case-parents study and thus may enhance statistical power for rare variant association analysis. Therefore, it is useful to develop statistical approaches in case-parents study by allowing for missing parental genotypes to test rare variant-disease association.

In this report, we will extend the collapsing method for rare variant association analysis to case-parents study by using the genotype difference of affected offspring with their corresponding “complements” in case-parents trios and TDT framework. Our strategy allows for missing one or both parental genotypes (or case only). We develop our strategy in homogenous populations. Through simulation studies, we investigate the performance of the proposed method in a homogenous population as well as in populations with population stratification under three scenarios: complete case-parents data mixed with one parental genotypes missing, complete case-parents data mixed with both parental genotypes missing, and complete case-parents data mixed with one and both parental genotypes missing.

Methods

In this study, all datasets were publically available and no research requiring ethics approval was conducted.

Notation

Consider a data set in a homogenous population Ω = {Ω₀, Ω_Ι, Ω_ΙΙ} consists of three types of case-parents trios with the genotype of affected offspring known in each family. Ω₀, Ω_Ι, and Ω_ΙΙ denote three types of case-parents trios when there are 0, 1, and 2 missing parental genotypes, respectively. We consider three combinations of Ω₀, Ω_Ι, and Ω_ΙΙ: Ω_0 + Ι = {Ω₀, Ω_Ι}, Ω_0 + ΙΙ = {Ω₀, Ω_ΙΙ}, and Ω_0+Ι+ΙΙ = {Ω₀, Ω_Ι, Ω_ΙΙ}. Ω_0 + Ι is the samples data set consisting of complete case-parents trio with the known genotypes for each member in the trio (Ω₀) and case-parents trio with missing one parental genotype (type Ω_Ι). Ω_0 + ΙΙ is the samples data set consisting of type Ω₀ and type Ω_ΙΙ with missing genotypes of both parents. Ω_0+Ι+ΙΙ includes sample data of all types of Ω₀, Ω_Ι, and Ω_ΙΙ. We assume N case-parents trios with N₀, N_I, N_II trios for Ω₀, Ω_Ι, and Ω_ΙΙ, respectively, are sampled (N = N₀+ N_I + N_II). Let G_O be the minor allele count carried by the affected offspring. Let {G_F, G_M} be the minor allele count carried by parents in a case-parents trio. The curly braces indicate set notation rather than ordered pairs. For example, {G_F, G_M} = {1, 2} means G_F = 1, G_M = 2 or G_F = 2, G_M = 1. Let a triplet ({G_F, G_M}, G_O) be a case-parents trio.

Rare variants association analysis

Let x = 2G_O − G_F − G_M be the paired difference in genotypes between the affected offspring and the complement (pseudo-control). We consider k variants with q causal variants in an interesting region, e.g., a gene region. The variants and case-parents trios are indexed by i and j (i = 1, ⋯, k; j = 1, 2, ⋯, N), respectively. We redefine a paired difference $ {\tilde{x}}_{ij} $ for jth trio at ith variant as flowing,

$$ {\tilde{x}}_{ij}=\left\{\begin{array}{c}{x}_{ij},\kern14em \left(\left\{{G}_{Fij},{G}_{Mij}\right\},{G}_{Oij}\right)\in {\Omega}_0\\ {}E\left({x}_{ij}|{G}_{Oij}|{G}_{Fij}\right)\ \mathrm{or}\ E\left({x}_{ij}|{G}_{Oij}|{G}_{Mij}\right),\left(\left\{{G}_{Fij},{G}_{Mij}\right\},{G}_{Oij}\right)\in {\Omega}_{\mathbf{I}}\\ {}E\left({x}_{ij}|{G}_{Oij}\right),\kern11.2em \left(\left\{{G}_{Fij},{G}_{Mij}\right\},{G}_{Oij}\right)\in {\Omega}_{\mathbf{I}\mathbf{I}}\end{array}\right. $$

(1)

We can calculate E(x|⋅) under the assumption of random mating (thus Hardy-Weinberg equilibrium) and with the rule of genetic inheritance if ({G_F , G_M}, G_O) ∈ Ω_Ι or Ω_ΙΙ. For example, when ({G_F , G_M}, G_O) ∈ Ω_ΙΙ, P{{G_F , G_M} = {1, 1}| G_O = 2} = (1 − MAF)², P{{G_F , G_M} = {1, 2}| G_O = 2} = 2MAF ⋅ (1 − MAF), and P{{G_F , G_M} = {2, 2}| G_O = 2} = MAF², then P{x = 0| G_O = 2} = P{{G_F , G_M} = {2, 2}| G_O = 2} = MAF², P{x = 1| G_O = 2} = P{{G_F , G_M} = {1, 2}| G_O = 2} = 2MAF ⋅ (1 − MAF), and P{x = 2| G_O = 2} = P{{G_F , G_M} = {1, 1}| G_O = 2} = (1 − MAF)². Thus

$$ E\left(x|{G}_O=2\right)=\underset{i=0}{\overset{2}{\Sigma}}P\left\{x=i|{G}_O=2\right\}\cdot i=2\left(1-\mathrm{MAF}\right) $$

(2)

We use the known parental genotypes or the background-population of samples to estimate MAF. Other E(x| ⋅) can be calculated similar to Eq. (2).

The collapsing method for rare variants can be directly extended to family-based study with the difference vectors in case-parents data. We denote this method as Z_c which can be defined as

$$ {Z}_C=\frac{U}{\sqrt{Var(U)}} $$

(3)

where $ U={\mathbf{1}}^T\overline{X} $, 1 is a k-dimensional vector 1 = (1, ⋯, 1)^T, $ \overline{X}=\frac{1}{N}{\left(\underset{j=1}{\overset{N}{\Sigma}}{x}_{1j},\underset{j=1}{\overset{N}{\Sigma}}{x}_{2j},\cdots, \underset{j=1}{\overset{N}{\Sigma}}{x}_{kj}\right)}^T $, $ {\sigma}_{ij}=\frac{1}{\left(N-1\right)}\underset{r,\mathrm{s}=1}{\overset{N}{\Sigma}}\left({x}_{ir}-\frac{1}{N}\underset{r=1}{\overset{N}{\Sigma}}{x}_{ir}\right)\left({x}_{js}-\frac{1}{N}\underset{s=1}{\overset{N}{\Sigma}}{x}_{js}\right), $ and $ Var(U)=\frac{1}{N}\underset{\mathrm{i},\mathrm{j}=1}{\overset{k}{\Sigma}}{\sigma}_{ij} $. When consider missing parental genotypes, we substitute $ {\tilde{x}}_{ij} $ for x_ij and denote the test statistic by $ {\tilde{Z}}_C $,

$$ {\tilde{Z}}_C=\frac{\tilde{U}}{\sqrt{Var\left(\tilde{U}\right)}} $$

(4)

In the TDT framework, we let b_ij be the number of the minor allele transmitted from heterozygous parent to the affected offspring at variant i in jth trio and c_ij be the number of the major allele transmitted from heterozygous parent to the affected offspring at variant i in jth trio. Let $ {b}_i=\underset{j}{\Sigma}{b}_{ij} $be the total number of minor-allele-transmitted from heterozygous parents to the affected offspring at ith variant and $ {c}_i=\underset{j}{\Sigma}{c}_{ij} $ is the total number of major-allele-transmitted from heterozygous parents to the affected offspring at variant i. The collapsing method for rare variants in TDT framework (corresponding to TDT_BRV in He et al. 2014) is

$$ {TDT}_{\mathrm{BRV}}=\frac{{\left({\Sigma}_i^k{b}_i-{\Sigma}_i^k{c}_i\right)}^2}{\Sigma_i^k{b}_i+{\Sigma}_i^k{c}_i} $$

(5)

When consider missing parental genotypes, we define $ {\tilde{c}}_{ij} $ and $ {\tilde{b}}_{ij} $ as following,

$$ {\tilde{b}}_{ij}=\left\{\begin{array}{c}\ {b}_{ij},\kern6em \left(\left\{{G}_{Fij},{G}_{Mij}\right\},{G}_{Oij}\right)\in {\Omega}_0\ \\ {}E\left({b}_{ij}|{G}_{Oij},{G}_{Fij}\right),\kern1.4em \left(\left\{{G}_{Fij},{G}_{Mij}\right\},{G}_{Oij}\right)\in {\Omega}_{\mathbf{I}}\\ {}\ E\left({b}_{ij}|{G}_{Oij}\right),\kern3.5em \left(\left\{{G}_{Fij},{G}_{Mij}\right\},{G}_{Oij}\right)\in {\Omega}_{\mathbf{I}\mathbf{I}}\end{array}\right., $$

$$ {\tilde{c}}_{ij}=\left\{\begin{array}{c}{c}_{ij},\kern6em \left(\left\{{G}_{Fij},{G}_{Mij}\right\},{G}_{Oij}\right)\in {\Omega}_0\\ {}E\left({c}_{ij}|{G}_{Oij},{G}_{Fij}\right),\kern0.5em \left(\left\{{G}_{Fij},{G}_{Mij}\right\},{G}_{Oij}\right)\in {\Omega}_{\mathbf{I}}\\ {}E\left({c}_{ij}|{G}_{Oij}\right),\kern2.75em \left(\left\{{G}_{Fij},{G}_{Mij}\right\},{G}_{Oij}\right)\in {\Omega}_{\mathbf{I}\mathbf{I}}\end{array}\right. $$

In Additional file 1: Table S1 and Additional file 2: Table S2 present all the expectations ofE(x| ⋅), E(b| ⋅), and E(c| ⋅) when ({G_F, G_M}, G_O) ∈ Ω_Ι and ({G_F, G_M}, G_O) ∈ Ω_ΙΙ, respectively. We substitute $ {\tilde{b}}_{ij} $ and $ {\tilde{c}}_{ij} $for b_ij and c_ij, and denote the test statistic of the TDT_BRV method by $ {TDT}_{\mathrm{BRV}} $:

$$ {TDT}_{BRV}=\frac{{\left({\Sigma}_i^k{\tilde{c}}_i-{\Sigma}_i^k{\tilde{b}}_i\right)}^2}{\Sigma_i^k{\tilde{c}}_i+{\Sigma}_i^k{\tilde{b}}_i} $$

(6)

Results

Simulation setting

To assess the performance of our method, we perform a series of simulation studies under a wide range of parameter values. The simulation parameter includes the total number of variants (k), the MAF of each variant, the number (q) and effect size (measured by the odds ratio (OR)) of causal variants, and the sample size (N) for case-parents trios with the number of case-parents trios for Ω₀(N₀), Ω_Ι(N_I), and Ω_ΙΙ(N_II). We simulate two populations.

In the first population, 1000 case-parents families are generated and the parameters are chosen as follows: k = 20; q = 0.2 k, 0.4 k, 0.6 k, 0.8 k; MAF ∈ (0.001, 0.01) with uniform distribution for each variant. Under the null hypothesis of no association, we set OR = 1 for all the variants. Under the alternative hypothesis of association, we set OR = 1 for non-causal variants. Under the alternative hypothesis, two scenarios are considered for the effects of causal variants. First, the causal variants have the same positive direction but different effects. Here we set OR∈[1.2, 3] in arithmetic progression. Second, the causal variants have opposite effects. Here we set OR∈ [0.2, 0.9] ∪[1.2, 3] with half of causal variants belonging to [0.2, 0.9] and the other causal variants belonging to [1.2, 3] in arithmetic progression.

In the second population, 500 case-parents families and a number of unaffected individuals are generated (here, 500 unaffected individuals are generated and they are used to estimate MAF when samples come from the second population). The parameter settings are similar to those in the first population except that the OR of causal variants under the alternative hypothesis. We let the OR of each causal variant in the second population be 0.1 less than that in the first population.

Once the parameter values are chosen, we first generate parental haplotypes based on a latent variable Z = (Z₁, ⋯, Z_k) from a multivariate normal distribution with marginal standard normal and covariance structure as described below [16, 17]: if variants i and j are both causal or both non-causal, then the correlation is set to be Corr (Z_i, Z_j) =0.4^{∣i − j∣}; otherwise the correlation is zero. We transform Z_i to 0 (major allele) or 1 (minor allele) according to the MAF of the ith variant and combine two haplotypes to obtain the parental haplotypes [16, 17]. Offspring haplotypes are generated from the parental haplotype assuming no recombination between the variants. The disease status for an offspring’s phenotype is determined by the following logistic model [18]:

$$ P\left( Affected|{G}_{Oij},i=1,\cdots, k\right)=\frac{1}{1+\exp \left(-\gamma \right)}, $$

$$ \gamma =\ln \left(\frac{c}{1-c}\right)+\underset{i=1}{\overset{k}{\Sigma}}\ln \left(\mathrm{O}{\mathrm{R}}_i\right)\cdot {G}_{Oij} $$

where OR_i is the odds ratio of ith variant, G_Oij is the minor allele count carried by the affected offspring in the jth trio at the ith variant, and c is the background prevalence of being affected for a subject with no minor alleles. Here, we let c = 0.01 in the first population and c = 0.008 in the second population.

The 1000 case-parents trios in the first population are composed of three types of trios: 500 (N₀) forΩ₀, 250 for Ω_Ι by randomly discarding one set of parental genotypes, and 250 for Ω_ΙΙ by discarding both parental genotypes. There are two types of trios in the 500 case-parents trios in the second population: 250 for Ω_Ι and 250 for Ω_ΙΙ. In our analysis, we fix N₀ (=500) and change N_I and N_II. We let N_I and N_II take the value of $ \frac{1}{10} $N₀, $ \frac{1}{5} $N₀, and $ \frac{1}{2} $N₀. We calculate Z_c and TDT_BRV in Ω₀ and $ {\tilde{Z}}_C $ and $ {TDT}_{\mathrm{BRV}} $ in Ω_0 + Ι,Ω_0 + ΙΙ, and Ω_0+Ι+ΙΙ. The p-value of statistical tests is estimated by a permutation procedure as follows: First calculate the data-based statistic, then recalculate permutation-based statistic by randomly changing signs (positive or negative) of x_ij for $ {\tilde{Z}}_C $ and permuting the “transmitted” and “not transmitted” labels randomly for $ {TDT}_{\mathrm{BRV}} $ with equal probability. We repeat this process 1000 times and p-value is estimated as the proportion of permutation-based statistics that are larger than the data-based statistic. For a given significance level α, the power/type I error rate is estimated as the proportion of rejecting the null hypothesis when p-value ≤α with 1000 replicates.

Type I error rates and power

We investigate the performance of our method in a homogeneous population and in populations with population stratification. For the homogeneous population, all samples come from the first population. For the population stratification, case-parents trios with missing parental genotypes come from the second population. We present in Table 1 the type I error rates when α = 0.05, 0.001. As shown in Table 1, for three situations of Ω_0 + Ι, Ω_0 + ΙΙ, and Ω_0+Ι+ΙΙ, the type I error rates are well-controlled around the nominal levels. This indicates the validity of the method when considering missing one parental genotypes or case only even in population stratification.

Table 1 Type I error rate of $ {\tilde{Z}}_C $ and $ {TDT}_{\mathrm{BRV}} $

Full size table

We present in Tables 2, 3 and 4 the power of $ {\tilde{Z}}_C $ and$ {TDT}_{\mathrm{BRV}} $ in the homogeneous population for three situations of Ω_0 + Ι, Ω_0 + ΙΙ, and Ω_0+Ι+ΙΙ, respectively, when causal variants have the same positive direction but different effects or causal variants have opposite effects. We can see from Table 1 that, when causal variants have different effects with the same direction and the proportion of non-causal variants is 80% or 60%, adding case-parents trios of Ω_Ι to complete case-parents data set can increase the power of $ {\tilde{Z}}_C $ and $ {TDT}_{\mathrm{BRV}} $ for rare variants association analysis. For example, when there are 80% non-causal variants, adding $ \frac{1}{10} $N₀ (50), $ \frac{1}{5} $N₀ (100), and $ \frac{1}{2} $N₀(250) case-parents trios of Ω_Ι to 500 complete case-parents trios improves the powers of $ {\tilde{Z}}_C $and$ {TDT}_{\mathrm{BRV}} $ from 0.408 and 0.602 to 0.566 and 0.712, to 0.674 and 0.748, and to 0.752 and 0.784, respectively. We observed that, although the power of $ {\tilde{Z}}_C $ is lower than that of $ {TDT}_{\mathrm{BRV}} $with the use of complete case-parents data, adding $ \frac{1}{2} $N₀ case-parents trios of Ω_Ι to complete case-parents trios helps $ {\tilde{Z}}_C $ achieve similar power as that of $ {TDT}_{\mathrm{BRV}} $. We also noted that, when the number of non-causal variants is small (40% or 20%), since the two statistics have high power just by using 500 complete case-parents trios, adding case-parents trios of Ω_Ι does not help to improve power. As we decrease the sample size to 200, adding case-parents trios of Ω_Ι can still improve power of $ {\tilde{Z}}_C $ and $ {TDT}_{\mathrm{BRV}} $(data not shown). When causal variants have opposite effects, we also observed that adding case-parents trios of Ω_Ι can improve the statistical power.

Table 2 Empirical power at the 0.05 significance level for Ω_0 + I in the homogenous population

Full size table

Table 3 Empirical power at the 0.05 significance level for Ω_0 + II in the homogenous population

Full size table

Table 4 Empirical power at the 0.05 significance level for Ω_{0 + I + II} in the homogenous population

Full size table

In order to further show the magnitude of power improvement of $ {\tilde{Z}}_C $ and $ {TDT}_{\mathrm{BRV}} $, we present in parentheses in Tables 2, 3 and 4 the proportion of power improved by adding case-parents trios of Ω_Ι, Ω_ΙΙ, and Ω_Ι+ΙΙto complete case-parents data set Ω₀. It can be found from Table 2 that the proportion of power improvement drops with a decrease in the number of non-causal variants, and the proportion of power improvement for $ {\tilde{Z}}_C $ is higher than that for $ {TDT}_{\mathrm{BRV}} $. When causal variants have opposite effects, we observed that the proportion of power improvement is larger than that when causal variants have the same direction. As the proportion of non-causal variants decreases from 80% to 60%, the proportion of power improvement increases. For example, while the powers of $ {TDT}_{\mathrm{BRV}} $ and $ {\tilde{Z}}_C $ for 80% non-causal variants have improved by 20.4% to 38.1% and 62.9% to 117% with the number of case-parents trios of Ω_Ι increasing from $ \frac{1}{10} $N₀ to $ \frac{1}{2} $N₀, respectively, the powers of $ {TDT}_{\mathrm{BRV}} $ and $ {\tilde{Z}}_C $ for 60% non-causal variants have improved by 25.8 to 61.3% and 89.0 to 239%, respectively. However, the proportion of power improvement drops with a further decrease in the number of non-causal variants. For example, with the number of case-parents trios of Ω_Ι increasing from $ \frac{1}{10} $N₀to $ \frac{1}{2} $N₀, the proportions of power improvement of $ {TDT}_{\mathrm{BRV}} $ change from7.58 to 36.4% for 40% non-causal variants and from 5.76 to 35.2% for 20% non-causal variants, respectively, and the proportions of power improvement of $ {\tilde{Z}}_C $ change from 34.0 to 96.6% for 40% non-causal variants and from 31.3 to 98.7% for 20% non-causal variants, respectively. This result indicates that, when causal variants have opposite effects, the proportion of power improvement increases early then decreases later with the increase in the number of non-causal variants. Adding case-parents trios of Ω_Ι with the medium number of non-causal variants is best for power improvement. For Ω_0 + ΙΙand Ω_0+Ι+ΙΙ, Tables 3 and 4 show similar results as those for Ω_0 + Ι. In addition, we observed that the proportion of power improvement for Ω_0 + Ι is the largest among three situations of Ω_0 + Ι,Ω_0 + ΙΙ, and Ω_0+Ι+ΙΙ.

When there is population stratification, Additional file 3: Figure S1-S4 shows the power of $ {\tilde{Z}}_C $ and $ {TDT}_{\mathrm{BRV}} $against the sample size for various proportions of non-causal variants under three situations of Ω_0 + Ι,Ω_0 + ΙΙ, and Ω_0+Ι+ΙΙ, respectively. The results are similar to those in the homogeneous population. We also consider a general situation for population stratification: Samples from two populations both consist of case-parents trios of three types, Ω₀, Ω_Ι, and Ω_ΙΙ. The simulation results are similar to those in Additional file 3: Figures S1-S4 (data not shown). These results indicate that, when adding case-parents trios with missing parental genotypes or even case only to complete a case-parents data set, population stratification does not affect the power of these two statistics for rare variant association analysis.

Discussion

In this report, we considered case-parents data with missing parental genotypes for rare variant association analysis in case-parents studies. Based on the collapsing method with the difference vector and TDT framework, we presented two statistics, $ {\tilde{Z}}_C $ and $ {TDT}_{\mathrm{BRV}} $, allowing for missing parental genotypes. The key in the proposed approach is to estimate the MAF. Actually, in clinical research, experimental design is usually done for a homogenous population or several specific populations. One can use the known parental genotypes or the background-population of samples to estimate MAF. We investigated the performance of these two statistics in three different situations: complete case-parents data mixed with one parental genotype missing, complete case-parents data mixed with both parental genotypes missing, and complete case-parents data mixed with one and both parental genotypes missing. Through simulation studies, we found that adding case-parents data with missing parental genotypes to complete case-parents data set can greatly improve the power of these two statistics, though the proportion of power improvement varied. Additionally, our strategy is not affected by population stratification.

In most studies of disease associations with rare variants, family- and population-based samples were used separately [6, 7, 11, 19, 20]. Although family-based studies have several advantages over population-based studies in rare variant association analysis, it is often difficult to recruit sufficiently large family-based samples, especially for rare diseases. More often, information about parents is incomplete, and this poses some challenges in analysis. Discarding those families with missing parental genotypes will further reduce the sample size and result in a loss of statistical power. In our strategy, case-parents trios missing one or both parental genotypes are kept in analysis and thus can help to greatly improve statistical power. Furthermore, we can see that missing both parental genotypes corresponds to case only. This means we can use unrelated affected individuals in case-parents studies, which is useful for case-parents studies with small sample size. Although population stratification might exist in these unrelated affected individuals recruited from population-based samples, our strategy is not affected by population stratification. Our simulation results showed that combining unrelated affected individuals with complete case-parents data could increase power by 5 ~ 60% for $ {TDT}_{\mathrm{BRV}} $and 20 ~ 200% for$ {\tilde{Z}}_C $ in both homogenous populations and populations with population stratification.

In addition to allowing for missing parental genotypes, our method can be used to address another problem when there are missing genotypes for individual variants in parental data. In fact, when individual variants are analyzed and there are missing genotypes for some variants, removing those samples for variants with missing genotypes will result in inconsistency of the sample size. With the strategy described above, our method can overcome this problem. However, our strategy is not suitable for case-parents trios with missing offspring genotypes, so further study is needed to address such scenarios.

Conclusions

The proposed strategy allowing for missing parental genotypes, or even adding unrelated affected individuals, can greatly improve the statistical power for rare variant-disease association and meanwhile is not affected by population stratification.

Abbreviations

BRV:: Burden of rare variants
CMC:: Combined multivariate and collapsing
LD:: Linkage disequilibrium
MAF:: Minor allele frequency
OR:: Odds ratio
TDT:: Transmission/disequilibrium test
VT:: Variable threshold
WSS:: Weighted sum statistic

References

Ott J, Kamatani Y, Lathrop M. Family-based designs for genome-wide association studies. Nat Rev Genet. 2011;12(7):465–74.
Article CAS PubMed Google Scholar
Mathieson I, McVean G. Differential confounding of rare and common variants in spatially structured populations. Nat Genet. 2012;44(3):243–6.
Article CAS PubMed PubMed Central Google Scholar
Liu J, Lewinger JP, Gilliland FD, Gauderman WJ, Conti DV. Confounding and heterogeneity in genetic association studies with admixed populations. Am J Epidemiol. 2013;177(4):351–60.
Article PubMed PubMed Central Google Scholar
He Z, Zhang D, Renton AE, Li B, Zhao L, Wang GT, Goate AM, Mayeux R, Leal SM. The rare-variant generalized disequilibrium test for association analysis of nuclear and extended pedigrees with application to alzheimer disease WGS data. Am J Hum Genet. 2017;100(2):193–204.
Article CAS PubMed PubMed Central Google Scholar
Shi M, Umbach DM, Weinberg CR. Identification of risk-related haplotypes with the use of multiple SNPs from nuclear families. Am J Hum Genet. 2007;81(1):53–66.
Article CAS PubMed PubMed Central Google Scholar
Li B, Leal SM. Methods for detecting association with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008;83(3):311–21.
Article CAS PubMed PubMed Central Google Scholar
Lin DY, Tang ZZ. A general framework for detecting disease associations with rare variants in sequencing studies. Am J Hum Genet. 2011;89(3):354–67.
Article CAS PubMed PubMed Central Google Scholar
Madsen BE, Browning SR. A groupwise association test for rare mutations using a weighter sum statistic. PLoS Genet. 2009;5(2):e1000384.
Article PubMed PubMed Central Google Scholar
Price AL, Kryukov GV, Bakker PIW, Purcell SM, Staples J, Wei LJ, Sunyaev SR. Pooled association tests for rare variants in exon-resequencing studies. Am J Hum Genet. 2010;86(6):832–8.
Article PubMed PubMed Central Google Scholar
Spielman RS, McGinnis RE, Ewens WJ. Transmission test for linkage disequilibrium: the insulin gene region and insulindependent diabetes mellitus (IDDM). Am J Hum Genet. 1993;52(3):506–16.
CAS PubMed PubMed Central Google Scholar
He Z, O’Roak BJ, Smith JD, Wang G, Hooker S, Santos-Cortez RLP, Li B, Kan M, Krumm N, Nickerson DA, Shendure J, Eichler EE, Leal SM. Rare-variant extensions of the transmission disequilibrium test: application to autism exome sequence data. Am J Hum Genet. 2014;94(1):33–46.
Article CAS PubMed PubMed Central Google Scholar
McIntyre LM, Martin ER, Simonsen KL, Kaplan NL. Circumventing multiple testing: a multilocus Monte Carlo approach to testing for association. Genet Epidemiol. 2000;19(1):18–29.
Article CAS PubMed Google Scholar
Li YM, Xiang Y. Detecting disease association with rare variants in case- parents studies. J Hum Genet. 2017;62(5):549–52.
Article CAS PubMed Google Scholar
Allen AS, Rathouz PJ, Satten GA. Informative missingness in genetic association studies: case-parent designs. Am J Hum Genet. 2003;72(3):671–80.
Article CAS PubMed PubMed Central Google Scholar
Sebastiani P, Abad MM, Alpargu G, Ramoni MF. Robust transmission/ disequilibrium test for incomplete family genotypes. Genetics. 2004;168(4):2329–37.
Article PubMed PubMed Central Google Scholar
Basu S, Pan W. Comparison of statistical tests for disease association with rare variants. Genet Epidemiol. 2011;35(7):606–19.
Article PubMed PubMed Central Google Scholar
Sun L, Wang C, Hu YQ. Utilizing mutual information for detecting rare and common variants associated with a categorical trait. Peer J. 2016;4:e2139.
Article PubMed PubMed Central Google Scholar
Preston MD, Dudbridge F. Utilising family-based designs for detecting rare variant disease associations. Ann Hum Genet. 2014;78(2):129–40.
Article PubMed PubMed Central Google Scholar
MC W, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011;89(1):82–93.
Article Google Scholar
Ionita-Laza I, Lee S, Makarov V, Buxbaum JD, Lin X. Family-based association tests for sequence data, and comparisons with population-based association tests. Eur J Hum Genet. 2013;21(10):1158–62.
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgments

LYM was partially supported by National Natural Science Foundation of China (11301206), Scientific Research Fund of Hunan Provincial Education Department (16A166), Hunan Provincial Natural Science Foundation of China (2017JJ2212), and China Scholarship Council (National cooperation fund of Hunan Province). HWD was partially supported by grants from the National Institutes of Health (R01AR057049, R01AR059781, D43TW009107, P20 GM109036, R01MH107354, R01MH104680, R01GM109068, R01AR069055), the Edward G. Schlieder Endowment fund to Tulane University. The authors would like to appreciate the assistance of Loula Burton, Office of Research in Tulane University, in editing the manuscript.

Funding

This work was financially supported by the funding sponsors of National Natural Science Foundation of China (11301206),Scientific Research Fund of Hunan Provincial Education Department (16A166), and Hunan Provincial Natural Science Foundation of China (2017JJ2212).

Availability of data and materials

All data generated or analysed during this study are included in this published article.

Author information

Authors and Affiliations

School of Mathematics and Computational Science, Huaihua University, Huaihua, Hunan, 418008, People’s Republic of China
Yumei Li & Yang Xiang
Center for Bioinformatics and Genomics, Department of Global Biostatistics and Data Science, Tulane University, New Orleans, LA, 70112, USA
Yumei Li, Chao Xu, Hui Shen & Hongwen Deng
Center for Bioinformatics and Genomics, School of Public Health and Tropical Medicine, Tulane University, New Orleans, LA, 70112, USA
Hongwen Deng

Authors

Yumei Li
View author publications
You can also search for this author in PubMed Google Scholar
Yang Xiang
View author publications
You can also search for this author in PubMed Google Scholar
Chao Xu
View author publications
You can also search for this author in PubMed Google Scholar
Hui Shen
View author publications
You can also search for this author in PubMed Google Scholar
Hongwen Deng
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

LYM conceived the idea, designed the study, and wrote the manuscript. XY developed the statistical method. XC, ShH, and DHW revised the manuscript. All authors have read and approved the final version of the manuscript.

Corresponding authors

Correspondence to Yumei Li or Hongwen Deng.

Ethics declarations

Ethics approval and consent to participate

This study has not directly involved humans, animals or plants. So no consent to participate was required.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional files

Additional file 1: Table S1.

All the expectations of E(x),E(b),and E(c) when (G_F, G_M ,G_O)∈Ω_I. (PDF 109 kb)

Additional file 2: Table S2.

All the expectations of E(x),E(b),and E(c) when (G_F, G_M ,G_O)∈Ω_II. (PDF 88 kb)

Additional file 3: Figure S1.

Empirical power against the sample size at the 0.05 significance level in population stratification when there are 20% non-causal variants. Note: A and B are for$ {\tilde{Z}}_C $, and C and D are for $ {TDT}_{\mathrm{BRV}} $when causal variants have different effects with the same direction and causal variants have opposite effects, respectively. The sample size N=N₀, N₀ + 1/10 N₀, N₀ + 1/5 N₀, N₀ + 1/2 N₀ with N₀ = 500 denoted by 0, 1/10, 1/5, and 1/2 respectively. Ω_0 + I (○), Ω_0 + II (*), Ω_{0 + I + II} (+). Figure S2. Empirical power against the sample size at the 0.05 significance level in population stratification when there are 40% non-causal variants. Note: A and B are for$ {\tilde{Z}}_C $, and C and D are for $ {TDT}_{\mathrm{BRV}} $ when causal variants have different effects with the same direction and causal variants have opposite effects, respectively. The sample size N=N₀, N₀ + 1/10 N₀, N₀ + 1/5 N₀, N₀ + 1/2 N₀ with N₀ = 500 denoted by 0, 1/10, 1/5, and 1/2 respectively. Ω_0 + I (○), Ω_0 + II (*), Ω_{0 + I + II} (+). Figure S3. Empirical power against the sample size at the 0.05 significance level in population stratification when there are 60% non-causal variants. Note: A and B are for $ {\tilde{Z}}_C $, and C and D are for $ {TDT}_{\mathrm{BRV}} $when causal variants have different effects with the same direction and causal variants have opposite effects, respectively. The sample size N=N₀, N₀ + 1/10 N₀, N₀ + 1/5 N₀, N₀ + 1/2 N₀ with N₀ = 500 denoted by 0, 1/10, 1/5, and 1/2 respectively. Ω_0 + I (○), Ω_0 + II (*), Ω_{0 + I + II} (+). Figure S4. Empirical power against the sample size at the 0.05 significance level in population stratification when there are 80% non-causal variants. Note: A and B are for $ {\tilde{Z}}_C $, and C and D are for $ {TDT}_{\mathrm{BRV}} $ when causal variants have different effects with the same direction and causal variants have opposite effects, respectively. The sample size N=N₀, N₀ + 1/10 N₀, N₀ + 1/5 N₀, N₀ + 1/2 N₀ with N₀ = 500 denoted by 0, 1/10, 1/5, and 1/2 respectively. Ω_0 + I (○), Ω_0 + II (*), Ω_{0 + I + II} (+). (PDF 89 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Li, Y., Xiang, Y., Xu, C. et al. Rare variant association analysis in case-parents studies by allowing for missing parental genotypes. BMC Genet 19, 7 (2018). https://doi.org/10.1186/s12863-018-0597-8

Download citation

Received: 07 July 2017
Accepted: 04 January 2018
Published: 15 January 2018
DOI: https://doi.org/10.1186/s12863-018-0597-8

Rare variant association analysis in case-parents studies by allowing for missing parental genotypes

Abstract

Background

Results

Conclusions

Background

Methods

Notation

Rare variants association analysis

Results

Simulation setting

Type I error rates and power

Discussion

Conclusions

Abbreviations

References

Acknowledgments

Funding

Availability of data and materials

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Publisher’s Note

Additional files

Additional file 1: Table S1.

Additional file 2: Table S2.

Additional file 3: Figure S1.

Rights and permissions

About this article

Cite this article

Keywords

BMC Genomic Data

Contact us

Rare variant association analysis in case-parents studies by allowing for missing parental genotypes

Abstract

Background

Results

Conclusions

Background

Methods

Notation

Rare variants association analysis

Results

Simulation setting

Type I error rates and power

Discussion

Conclusions

Abbreviations

References

Acknowledgments

Funding

Availability of data and materials

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Publisher’s Note

Additional files

Additional file 1: Table S1.

Additional file 2: Table S2.

Additional file 3: Figure S1.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Genomic Data

Contact us