 Methodology article
 Open Access
 Published:
Generalized disequilibrium test for association in qualitative traits incorporating imprinting effects based on extended pedigrees
BMC Geneticsvolume 18, Article number: 90 (2017)
Abstract
Background
For dichotomous traits, the generalized disequilibrium test with the moment estimate of the variance (GDTME) is a powerful familybased association method. Genomic imprinting is an important epigenetic phenomenon and currently, there has been increasing interest of incorporating imprinting to improve the test power of association analysis. However, GDTME does not take imprinting effects into account, and it has not been investigated whether it can be used for association analysis when the effects indeed exist.
Results
In this article, based on a novel decomposition of the genotype score according to the paternal or maternal source of the allele, we propose the generalized disequilibrium test with imprinting (GDTI) for complete pedigrees without any missing genotypes. Then, we extend GDTI and GDTME to accommodate incomplete pedigrees with some pedigrees having missing genotypes, by using a Monte Carlo (MC) sampling and estimation scheme to infer missing genotypes given available genotypes in each pedigree, denoted by MCGDTI and MCGDTME, respectively. The proposed GDTI and MCGDTI methods evaluate the differences of the paternal as well as maternal allele scores for all discordant relative pairs in a pedigree, including beyond firstdegree relative pairs. Advantages of the proposed GDTI and MCGDTI test statistics over existing methods are demonstrated by simulation studies under various simulation settings and by application to the rheumatoid arthritis dataset. Simulation results show that the proposed tests control the size well under the null hypothesis of no association, and outperform the existing methods under various imprinting effect models. The existing GDTME and the proposed MCGDTME can be used to test for association even when imprinting effects exist. For the application to the rheumatoid arthritis data, compared to the existing methods, MCGDTI identifies more loci statistically significantly associated with the disease.
Conclusions
Under complete and incomplete imprinting effect models, our proposed GDTI and MCGDTI methods, by considering the information on imprinting effects and all discordant relative pairs within each pedigree, outperform all the existing test statistics and MCGDTI can recapture much of the missing information. Therefore, MCGDTI is recommended in practice.
Background
Genomic imprinting is an important epigenetic phenomenon in studying complex traits, where the expression levels of certain genes rely on their parental origin [1,2,3]. Morison et al. [4, 5] constructed an imprinted gene and parentoforigin effect database to collect genes that show imprinting effects, which has been updated by Glaser et al. [6] to include parental origin of de novo mutations. Furthermore, some researches have demonstrated that genomic imprinting plays an important role in several human genetic diseases such as BeckwithWiedemann syndrome, SilverRussell syndrome, pseudohypoparathyroidism and transient neonatal diabetes mellitus [7,8,9,10].
For a diallelic marker locus, there have been many familybased methods to test for the association between genotype scores and dichotomous traits [11,12,13,14,15]. Among them, the generalized disequilibrium test with the moment estimate of the variance (GDTME) [15] is a powerful method, which is the generalization of the traditional transmission disequilibrium test [11] by using the genotype differences between all discordant relative pairs (including those beyond firstdegree relatives) within a family. Currently, there has been increasing interest of incorporating imprinting to improve the test power of association analysis. However, GDTME does not take imprinting effects into account, and it has not been investigated whether it can be used for association analysis when the effects indeed exist. On the other hand, Xia et al. [16] developed the transmission disequilibrium test with imprinting for qualitative traits based on twogeneration nuclear families, while it is not suitable for extended pedigrees. As such, the pedigree disequilibrium test with imprinting (PDTI) and its extension Monte Carlo (MC) PDTI (MCPDTI) to accommodate pedigrees with missing genotypes were proposed to test for association, which consider the influence of imprinting on association study [17]. However, they only utilize the genotype differences between all firstdegree relative pairs in a family, which may reduce their test powers if ignoring the information on the genotype differences between beyond firstdegree relatives.
To incorporate imprinting effects into association analysis, in this article, we develop a novel decomposition of the genotype score of each individual according to the paternal or maternal source of the allele. Based on these paternal and maternal allele scores, we propose the generalized disequilibrium test with imprinting (GDTI) for association for complete pedigrees without any missing genotypes. Then, borrowing the idea of Zhou et al. [18] and Ding et al. [19], we further extend GDTI and GDTME to accommodate incomplete pedigrees where the genotypes of some individuals in pedigrees are missing, based on a MC sampling and estimation scheme to infer the missing genotypes given the observed genotypes in each pedigree, which are denoted by MCGDTI and MCGDTME, respectively. Advantages of the proposed GDTI and MCGDTI test statistics over existing methods are demonstrated by simulation studies under various simulation settings and by application to the rheumatoid arthritis (RA) dataset [20]. Simulation results show that the proposed GDTI, MCGDTI and MCGDTME control the type I error rates well under the null hypothesis of no association and no imprinting. The existing GDTME and the proposed MCGDTME can be used to test for association even when imprinting effects exist. MCGDTI can recapture much of the missing information. Further, the proposed tests outperform the existing methods under complete, incomplete and no imprinting effect models. For the real data application, compared to the existing methods, MCGDTI identifies more loci statistically significantly associated with RA after Bonferroni correction.
Methods
Notations
Suppose a diallelic marker locus with alleles M _{1} and M _{2}, and three possible genotypes are respectively M _{2} M _{2}, M _{1} M _{2} and M _{1} M _{1}. We consider a disease susceptibility locus with the disease allele D and the normal one d, and the corresponding ordered genotypes are D/D, D/d, d/D and d/d with penetrances f _{2}, f _{10}, f _{01} and f _{0}, respectively. f _{10} = f _{01} indicates no imprinting effects at the disease susceptibility locus. Further, the coefficient of linkage disequilibrium (LD) between alleles M _{1} and D is taken as \( \mathrm{LD}=P\left(D{M}_1\right){P}_D{P}_{M_1} \), where P(DM _{1}) is the frequency of haplotype DM _{1}, and P _{ D } and \( {P}_{M_1} \) are the allele frequency of D and M _{1}, respectively. Suppose that we collect n independent pedigrees. Within the i ^{th} pedigree which contains N _{ i } family members (i=1, 2, …, n), without loss of generality, we assume that the first A _{ i }individuals are affected and the other U _{ i } = N _{ i } − A _{ i } members are unaffected. Let Y _{ ij } be the disease status of the j ^{th} individual in the i ^{th} pedigree (i=1, 2, …, n; j=1, 2, …, N _{ i }), i.e. Y _{ ij }= 1 (0) denotes that the individual is affected (unaffected).
Existing generalized disequilibrium test with moment estimate of variance
We begin by describing the existing GDTME test [15]. For convenience, we define the genotype score X _{ ij } by the number of allele M _{1} in the genotype of the j ^{th} individual in the i ^{th} pedigree, i.e. X _{ ij }=0, 1 and 2 for the genotypes M _{2} M _{2}, M _{1} M _{2} and M _{1} M _{1}, respectively. As such, the logistic regression model is
where β _{0} is the intercept, and β _{1} is the regression coefficient; Y _{ ij } is the disease status of the j ^{th} individual in the i ^{th} pedigree. Then, the GDTME test statistic can be expressed as follows, which is used to model the association between the disease status and X _{ ij }:
where \( {S}_i=\sum_{j=1}^{A_i}\sum_{k={A}_i+1}^{N_i}\left({X}_{ij}{X}_{ik}\right)\frac{1}{N_i} \) is the score of the i ^{th} pedigree and \( {\sum}_{i=1}^n{S}_i^2 \) is an unbiased moment estimate of the variance of \( \sum_{i=1}^n{S}_i \). The variance of \( \sum_{i=1}^n{S}_i \) can also be estimated based on the information on kinship coefficients when identity by descent (IBD) is unknown [15]. For convenience, we denote the corresponding test statistic by GDT in this article.
GDTI for complete pedigree data
Although GDTME is a powerful association test and is robust to population stratification (PS) [15], it does not take the information on imprinting effects into consideration. In this article, we are going to investigate whether GDTME can be used to test for association when there are imprinting effects. Moreover, we propose the following generalized disequilibrium test incorporating imprinting effects (GDTI). Note that in GDTME, the genotype score X _{ ij } is coded as the counts of allele M _{1} for the j ^{th} individual in the i ^{th} pedigree, i.e.
To incorporate the information on imprinting effects into analysis, we divide the X _{ ij } into two parts, \( {X}_{ij}^{(p)} \) and \( {X}_{ij}^{(m)} \), according to the paternal or maternal source of the allele, where \( {X}_{ij}={X}_{ij}^{(p)}+{X}_{ij}^{(m)} \), and \( {X}_{ij}^{(p)} \) and \( {X}_{ij}^{(m)} \) are respectively coded as follows:
and
We call \( {X}_{ij}^{(p)} \) and \( {X}_{ij}^{(m)} \) the paternal allele score and the maternal allele score, respectively. So, we use the following logistic regression to model the association between the disease status Y _{ ij } and the allele scores \( {X}_{ij}^{(p)} \) and \( {X}_{ij}^{(m)} \):
where β _{0} is the intercept, and β _{ p } and β _{ m } are the regression coefficients; β _{ p } is used to describe the effect of allele M _{1} coming from his (her) father, and β _{ m } measures the effect of allele M _{1} whose parental origin is his (her) mother. The null hypothesis H _{0} : β _{ p } = β _{ m } = 0 denotes no association and no imprinting; β _{ p } = β _{ m } ≠ 0 indicates that the association exists while there are no imprinting effects, and the logistic regression model can be reduced to the model of GDTME (Equation (1)); β _{ p } ≠ β _{ m } represents that both association and imprinting effects exist. As such,
Note that the disease statuses of all the family members in each pedigree are uncorrelated, conditional on their own genotypes at the marker locus. Then, the likelihood that the first A _{ i } individuals are affected, conditional on the fact that there are A _{ i } affected individuals in total in the i ^{th} pedigree, is (the detailed derivation refers to Additional file 1: Appendix):
where s _{ l }’s are all the possible combination that A _{ i } out of N _{ i } individuals are affected by shuffling the affection statuses of all the N _{ i } individuals in the i ^{th} pedigree; s _{ l } is the l ^{th} possible combination; U _{ i } = N _{ i } − A _{ i } is the number of unaffected individuals in the i ^{th} pedigree. As such, the loglikelihood function for the i ^{th} pedigree is
Under the null hypothesis of no association (H _{0} : β _{ p } = β _{ m } = 0), the score test statistic for testing for association incorporating imprinting effects is formulated as follows (the details see Additional file 1: Appendix),
where \( \sum_{i=1}^n{D}_{i1} \) and \( \sum_{i=1}^n{D}_{i2} \) are the scores of β _{ p } and β _{ m }, respectively;
\( \left(\begin{array}{cc}\sum \limits_{i=1}^n{I}_{i11}& \sum \limits_{i=1}^n{I}_{i12}\\ {}\sum \limits_{i=1}^n{I}_{i21}& \sum \limits_{i=1}^n{I}_{i22}\end{array}\right) \) is the observed Fisher’s information matrix of β _{ p } and β _{ m };
and
GDTI asymptotically follows a chisquare distribution with the degrees of freedom being 2, under the null hypothesis of no association and no imprinting. It is noted from the above that the scores D _{ i1} and D _{ i2}evaluate the differences in paternal allele scores and maternal allele scores, respectively, for all discordant relative pairs in a pedigree, thus utilizing information beyond firstdegree relative pairs. This is in contrast to other association testing methods under imprinting (e.g. PDTI), where extended pedigrees are considered as multiple nuclear families, and so information is not fully utilized.
MCGDTI and MCGDTME for incomplete pedigree data
When the genotypes of some individuals in a pedigree are missing, GDTI cannot be used directly. Therefore, in presence of missingness, we extend GDTI and propose MCGDTI based on a MC sampling and estimation process, which may recapture most information on missing genotypes based on the observed genotypes. Specifically, we replace D _{ i1}, D _{ i2}, I _{ i11}, I _{ i12}, I _{ i21} and I _{ i22} in GDTI by their conditional expectations, D _{ i1MC }, D _{ i2MC }, I _{ i11MC }, I _{ i12MC }, I _{ i21MC } and I _{ i22MC }, given the observed genotypes, G _{ o }, where T _{ MC } = E(T(G _{ m }, G _{ o }, A) G _{ o }) for some statistic T, G _{ m } is the set of missing genotypes; A is the collection of the observed phenotypes (disease affection statuses); T(G _{ m }, G _{ o }, A) is the expanded notation of T to explicitly show its dependences on the missing genotypes G _{ m }, the observed genotypes G _{ o } and the observed phenotype collection A. Following Zhou et al. [18] and Ding et al. [19], we estimate D _{ i1MC }, D _{ i2MC }, I _{ i11MC }, I _{ i12MC }, I _{ i21MC } and I _{ i22MC } based on a MC simulation scheme. Specifically, if we set the MC size to be K, then we draw independent sample G _{ mk }, k = 1, 2, …, K, from P(G _{ m } G _{ o }), which can be accomplished efficiently based on the peeling algorithm using the SLINK software [21]. The statistic D _{ i1MC } can be estimated by\( {\widehat{D}}_{i1 MC}=\frac{1}{K}\sum_{k=1}^K{D}_{i1}\left({G}_{mk},{G}_o,A\right) \). D _{ i2MC }, I _{ i11MC }, I _{ i12MC }, I _{ i21MC } and I _{ i22MC } can be similarly estimated by \( {\widehat{D}}_{i2 MC} \), \( {\widehat{I}}_{i11 MC} \), \( {\widehat{I}}_{i12 MC} \),\( {\widehat{I}}_{i21 MC} \) and \( {\widehat{I}}_{i22 MC} \), respectively. Then, the MCGDTI statistic is calculated after replacing D _{ i1}, D _{ i2}, I _{ i11}, I _{ i12}, I _{ i21} and I _{ i22} in Equation (3) by the corresponding \( {\widehat{D}}_{i1 MC} \), \( {\widehat{D}}_{i2 MC} \), \( {\widehat{I}}_{i11 MC} \), \( {\widehat{I}}_{i12 MC} \), \( {\widehat{I}}_{i21 MC} \) and \( {\widehat{I}}_{i22 MC} \) values, respectively. MCGDTI has an asymptotic chisquare distribution with the degrees of freedom being 2 under the null hypothesis.
Earlier studies showed that the transmission disequilibrium test can be employed for association analysis even when there are imprinting effects [16], and we find out that GDTME can also be used for such a purpose (see simulation studies later). In this connection, for incomplete pedigree data, we extend GDTME without considering imprinting effects and propose MCGDTME to test for association based on the MC sampling and estimation scheme. Being similar to MCGDTI, the MCGDTME statistic can be calculated, as before, but substituting each S _{ i } in Equation (2) by \( {S}_{iMC}=\frac{1}{K}{\sum}_{k=1}^K{S}_i\left({G}_{mk},{G}_o,A\right) \), i.e. MCGDTME\( =\sum_{i=1}^n{S}_{iMC}/\sqrt{\sum_{i=1}^n{S}_{iMC}^2} \). MCGDTME follows a standard normal distribution approximately under the null hypothesis of no association.
Simulation settings
In this section, to explore the performance of the proposed GDTI, MCGDTI and MCGDTME statistics and compare the powers of GDTI, MCGDTI and MCGDTME with the existing MCPDTI, GDTME and GDT, we conduct the following simulation studies. We consider a homogeneous population. The marker locus and the disease susceptibility locus are in complete linkage. Three groups of haplotype frequencies for haplotypes DM _{1}, dM _{1}, DM _{2} and dM _{2} are considered to simulate the powers: LD1: {0.13, 0.02, 0.12, 0.73}, LD2: {0.23, 0.12, 0.02, 0.63} and LD3: {0.22, 0.03, 0.03, 0.72}, where the frequency \( {P}_{M_1} \) of marker allele M _{1} for each group is 0.15, 0.35 and 0.25 with the frequency P _{ D } of the disease allele D being fixed at 0.25, and the corresponding LD values are 0.092,5, 0.142,5 and 0.157,5, respectively. To investigate the empirical type I error rates under the null hypothesis of no association, the frequencies of four haplotypes are taken as the product of two allele frequencies on each haplotype, respectively. For example, when \( {P}_{M_1}=0.15 \), the frequency of haplotype DM _{1} is P(DM _{1})= 0.15×0.25 = 0.037,5.
Three sets of two homozygote penetrances f _{2} and f _{0} for genotypes D/D and d/d, {0.390, 0.260}, {0.440, 0.240} and {0.480, 0.220}, are investigated with the corresponding relative risk (RR=f _{2}/f _{0}) being 1.500, 1.833 and 2.182, respectively, which are similar to those in Ding et al. [19]. For each set of homozygote penetrances, three imprinting effect models by setting the various values of f _{10} and f _{01} are considered: no, incomplete and complete imprinting effect models. For no imprinting effect model, we set f _{1} = f _{10} = f _{01} = (f _{2} + f _{0})/2. Note that no association implies no imprinting effects. So, we simulate the type I error rates of the proposed test statistics only under no association and no imprinting. Tables 1 and 2 give the simulation settings for studying the empirical size and the test power, respectively.
In addition, three types of pedigree structure are considered in our simulation study. The pedigree structures are shown in Fig. 1: (a) twogeneration family with 5 individuals, (b) threegeneration pedigree with 10 individuals, and (c) fourgeneration pedigree with 12 individuals. In each replicate, we simulate 30 (50) pedigrees under each pedigree structure and the resulting total sample size is 90 (150). Here the ascertainment scheme for a pedigree to be included is that there is at least one affected nonfounder in the pedigree. For MCGDTI, MCGDTME and MCPDTI, 50 MC samples of missing genotypes are generated for each replicate with use of the SLINK software [21]. In the MC sampling process, both the true marker allele frequencies and those estimated from the genotyped founders in each replicate are used.
For assessing the performance of the proposed tests (GDTI, MCGDTI and MCGDTME) and for comparing with the existing GDTME and GDT without considering imprinting effects [15], and MCPDTI with incorporating imprinting [17], we consider the following 9 tests. GDTI is based on complete data assuming no missing genotypes. The other 8 tests are for incomplete data, after the removal of the genotypes of individual 1 in twogeneration families, individuals 1, 4 and 5 in threegeneration pedigrees and individuals 1 and 3 in fourgeneration pedigrees. MCGDTI_{T}, MCGDTME_{T} and MCPDTI_{T} are on the basis of the true marker allele frequencies, while MCGDTI_{E}, MCGDTME_{E} and MCPDTI_{E} are based on the estimated marker allele frequencies. GDTME and GDT are also considered for incomplete data. Under each simulation setting, 10,000 replicates are simulated and the significance level is set at 1%. All the simulations are implemented by using the R software (version 3.4.1) [22].
Results
Size and power
Under 9 simulation settings given in Table 1, the empirical type I error rates of GDTI, MCGDTI_{T}, MCGDTI_{E}, MCGDTME_{T}, MCGDTME_{E}, GDTME, GDT, MCPDTI_{T} and MCPDTI_{E} are demonstrated in Table 3, based on 90 and 150 pedigrees at the 1% significance level, respectively. It is shown in Table 3 that the size of all the methods is generally close to the nominal level 1% under the null hypothesis of no association and no imprinting, irrespective of different sample sizes. Thus, our proposed GDTI, MCGDTI_{T}, MCGDTI_{E}, MCGDTME_{T} and MCGDTME_{E} test statistics are valid for testing association.
Figures 2, 3 and 4 give the simulated powers of GDTI, MCGDTI_{T}, MCGDTI_{E}, MCGDTME_{T}, MCGDTME_{E}, GDTME, GDT, MCPDTI_{T} and MCPDTI_{E} based on 150 pedigrees at the 1% significance level under complete, incomplete and no imprinting effect models for different LD and RR values, respectively. The first 5 statistics are proposed tests, while the remaining four are existing tests. Additional file 1: Figures S1  S3 show the corresponding simulated powers of all the methods based on 90 pedigrees. From the figures, we find that the powers of MCGDTI, MCGDTME and MCPDTI based on the true marker allele frequencies are very close to those based on the estimated marker allele frequencies (MCGDTI_{T} vs MCGDTI_{E}, MCGDTME_{T} vs MCGDTME_{E}, and MCPDTI_{T} vs MCPDTI_{E}), respectively. MCGDTI_{T} and MCGDTI_{E} can recapture much of the missing information, which are a little less powerful than GDTI for complete pedigree data. The existing MCPDTI test performs the worst even though it is constructed for testing association when imprinting effects are taken into consideration. On the other hand, MCGDTME, GDTME and GDT, though without accounting for imprinting, can be used for testing association even when imprinting effects exist. Moreover, they outperform MCPDTI substantially. It is probably due to the fact that MCGDTME, GDTME and GDT consider genotype differences between all discordant relative pairs, thus utilizing much more information than firstdegree relative pairs used by MCPDTI. In Fig. 2 under complete imprinting effect model, when the LD and RR values are fixed, the proposed GDTI (assuming the data are complete) and MCGDTI statistics have higher powers than all the other test statistics. GDT (based on the IBD information) has better performance than GDTME, which is the result similar to that in Chen et al. [15]. When the LD value changes from 0.092,5 to 0.157,5 and RR is unchanged, or the LD value is fixed and RR increases from 1.500 to 2.182, all the powers become larger and larger. The results in Fig. 3 under incomplete imprinting effect model are similar to those in Fig. 2. Figure 4 shows the performance of various tests under the no imprinting effect model. The proposed MCGDTME outperforms all the existing methods. MCGDTI is a bit less powerful than MCGDTME, as expected, and it has a similar performance to GDTME and GDT. By comparing the results in Figs. 2, 3 and 4, we find that when the imprinting effect model changes from complete model to incomplete one (i.e. the degree of imprinting effects decreases), the powers of the GDTI and MCGDTI are smaller and smaller. GDTI and MCGDTI attain the least powers under the no imprinting effect model. Finally, the powers of all the methods based on 150 pedigrees are higher than those based on 90 pedigrees (Fig. 2 vs Additional file 1: Figure S1, Fig. 3 vs Additional file 1: Figure S2, and Fig. 4 vs Additional file 1: Figure S3), respectively.
Application to RA data
We apply our proposed methods to the RA dataset from North American Rheumatoid Arthritis Consortium [20], which is made available from Genetic Analysis Workshop 15 [23]. It has been approved by the providers of the RA data. In this dataset, a total of 757 pedigrees and 8017 individuals were collected, and 5407 autosomal single nucleotide polymorphisms (SNPs) were used. It should be noted that the genotypes of about 80% individuals are missing at these SNPs and thus the proposed MCGDTI (not GDTI) and MCGDTME methods are applied. To compare the performance of the proposed tests with the existing methods, we also implement the GDTME, GDT and MCPDTI methods in this real data analysis. On the other hand, note that there are 73 pedigree members with unknown affection statuses in this dataset. In addition, we use the existing Monte Carlo pedigree parentalasymmetry test (MCPPAT) to test if imprinting is present [18].
We use the following quality control rules to filter the data. First, a pedigree to be included has at least one affected nonfounder. Second, we delete pedigrees with stepfamilies. Finally, if the proportion of the individuals with missing genotypes among all the members in a pedigree is more than 50% based on the first SNP on Chromosome 1, then we exclude this pedigree. This can avoid the large variability on estimation created by pedigrees with high proportions of missingness. To this end, we get 246 pedigrees with 1109 individuals. Among them, there are 11 individuals with the affection statuses being unavailable and we treat them as unaffected. We use all the available individuals (1992 individuals) in this dataset to estimate the marker allele frequencies, not just using the available founders, due to the large proportion of the individuals with missing genotypes in this dataset. Then, we calculate the values and the corresponding pvalues of all the test statistics based on the estimated allele frequencies and 246 selected pedigrees. The significance level is fixed at α= 5%, and Bonferroni correction would test each individual hypothesis at the significance level of α ^{′}= 0.05/5407 = 9.247,3 × 10^{−6}, based on 5407 SNPs. The MC size for MCGDTI, MCGDTME, MCPDTI and MCPPAT is set to be 50.
The corresponding results of MCGDTI and MCGDTME at the significance level of α=5%, with Bonferroni correction based on the pvalues of these methods are shown in Table 4. From the table, MCGDTI identifies 3 SNPs statistically significantly associated with RA, which cannot be found by MCGDTME. Further, the 3 SNPs identified by MCGDTI cannot be detected by GDTME, GDT and MCPDTI, and the corresponding contingency tables are the same as Table 4, which are not shown for brevity. The results from this real data application demonstrate a gain in information through incorporating imprinting effects (compared to MCGDTME), through making use of partially genotyped pedigrees (compared to GDTME and GDT), and through including the genotype differences between beyond firstdegree relatives (compared to MCPDTI). In addition, we list the pvalues of the association tests MCGDTI, MCGDTME, GDTME, GDT, MCPDTI and the imprinting test MCPPAT at these 3 SNPs in Additional file 1: Table S1. From the pvalues of MCPPAT in this table, there are statistically significant imprinting effects at the 3 SNPs on RA, which may be why MCGDTI is more powerful than the other test statistics.
Discussion
In this article, based on a novel decomposition of the genotype score of an individual according to the paternal or maternal source of an allele, we develop the GDTI test to test for association incorporating imprinting for complete pedigrees without missing genotypes. Then, using a MC sampling and estimation scheme, we extend GDTI and GDTME, and respectively develop MCGDTI and MCGDTME to deal with incomplete pedigrees, in which some individuals’ genotypes are unavailable. Compared to PDTI and MCPDTI, GDTI and MCGDTI make use of the genotype differences between all discordant relative pairs, including beyond firstdegree relatives. Simulation results indicate that GDTI, MCGDTI and MCGDTME control the size well under the null hypothesis of no association and no imprinting. As for the simulated powers, under complete and incomplete imprinting effect models, our proposed GDTI and MCGDTI methods by considering the information on imprinting effects and all discordant relative pairs outperform all the existing test statistics and MCGDTI can recapture much of the missing information. The application to the RA dataset also demonstrates the advantage of MCGDTI over other methods. Further, in this article, we demonstrate that, the existing GDTME and the proposed MCGDTME, although not constructed under imprinting, can be used for testing association even when the effects exist. Moreover, we propose the MCGDTME test to handle incomplete pedigree data with missing genotypes, and the test is found to perform better than GDTME in simulation studies.
One of the major reasons for using withinfamily tests (e.g. GDTME and GDT) for association is their robustness to PS. On the other hand, note that MCGDTI, MCGDTME and MCPDTI need the MC sampling and estimation scheme to infer missing genotypes in pedigrees, which requires these pedigrees from a homogenous population. To investigate the performance of the proposed test statistics in the presence of PS, we consider a population consisting of two subpopulations and conduct the following simulation study. The parameters are set to be the same as those in Chen et al. [15]. Specifically, suppose that a disease susceptibility locus and a marker locus are in complete linkage but in linkage equilibrium and both allele frequencies P _{ D } and \( {P}_{M_1} \) are taken to be 0.1 (0.5) in the first (second) subpopulation. The penetrances f _{2}, f _{10}, f _{01} and f _{0} of genotypes D/D, D/d, d/D and d/d are assumed to be 0.45, 0.30, 0.30 and 0.20 in both subpopulations, respectively. In MCGDTI, MCGDTME and MCPDTI, the allele frequency \( {P}_{M_1} \) is estimated by genotyped founders from all the collected pedigrees, by assuming that they came from a single population, which may cause biases in the estimation of \( {P}_{M_1} \). Two simulation scenarios of pedigree structure or level of genotypic missingness are considered. In the first scenario, 150 pedigrees (50 twogeneration families, 50 threegeneration pedigrees and 50 fourgeneration pedigrees with the pedigree structures listed in Fig. 1) are sampled from each subpopulation and the only difference between two subpopulations is allele frequencies P _{ D } and \( {P}_{M_1} \). In the second scenario, 200 pedigrees (100 twogeneration families and 100 threegeneration pedigrees with the pedigree structures listed in Fig. 1) are simulated from the first subpopulation and 100 fourgeneration pedigrees with the pedigree structure listed in Fig. 1 are generated from the second subpopulation, where these two subpopulations are very different from each other in pedigree structure and level of genotypic missingness. Then, the resulting total sample size of pedigrees is 300 for each simulation scenario. Other simulation settings are the same as those in the Simulation settings subsection. The simulated size results of GDTI, MCGDTI, MCGDTME, GDTME, GDT and MCPDTI are shown in Table 5. From the table, we find that all the proposed test statistics control the size well under the PS models, while the size of the existing MCPDTI test is a little inflated.
Just like the genotypes of some members in the collected pedigrees may be missing, it is also common in practice that the affection statuses of some individuals in the pedigrees may be unavailable. As mentioned in the real data application subsection, one way to deal with these individuals is to treat them as unaffected. To investigate if this influences the validity of the proposed test statistics, we conduct a few simulation studies. The simulation results show that the proposed methods are still valid to test for association by handling the missing affection status in this way (data not shown). However, this may impact their test powers under alternative hypotheses and we will carry out some simulation studies to check it in our future work.
Like other methods, our proposed GDTI and MCGDTI have their own limitations. In this article, we only consider using an empirical moment estimate based on large sample theory to estimate the variances of the numerators of GDTI and MCGDTI, while we do not propose the corresponding tests based on the variance estimates from the IBD information. This is because even though the IBD information between two alleles for the pair of allele scores (\( {X}_{ij}^{(p)} \), \( {X}_{ik}^{(p)} \)), (\( {X}_{ij}^{(p)} \), \( {X}_{ik}^{(m)} \)), (\( {X}_{ij}^{(m)} \), \( {X}_{ik}^{(p)} \)) or (\( {X}_{ij}^{(m)} \), \( {X}_{ik}^{(m)} \)) of the j ^{th} and k ^{th} individuals in the i ^{th} pedigree is obtained, two allele scores in this pair may be different from each other for GDTI and MCGDTI and thus we cannot estimate the corresponding variance based on the IBD information, which is different from GDT (the details refer to Appendix B in Chen et al. [15]). For example, we consider a twogeneration family in which the genotypes of the unaffected parents and the affected child are M _{1} M _{2}, M _{1} M _{2} and M _{1} M _{1}, respectively. Then, when we compare the allele scores of the unaffected father and the affected child, the allele scores of the father and the child are respectively \( {X}_F^{(p)}={X}_F^{(m)}= \) 0.5 and \( {X}_C^{(p)}={X}_C^{(m)}= \) 1, which are different from each other. Fortunately, from our simulation study, MCGDTI for incomplete pedigrees even has the similar power to GDT under the no imprinting effect model, and is more powerful than GDT under the imprinting effect models.
We should mention that, because of utilizing the genotype differences between all discordant relative pairs, the requirement for a pedigree to be included is that this pedigree should have at least one affected and one unaffected individuals. In addition, GDTI and MCGDTI do not take account of the covariates in analysis, which may cause the dependences between individuals within a family, even though under the null hypothesis of no association. This may be handled from the quasilikelihood for a conditional logistic regression model [15, 24, 25]. So, our future work is to incorporate the covariates into GDTI and MCGDTI.
Conclusions
Under complete and incomplete imprinting effect models, our proposed GDTI and MCGDTI methods, by considering the information on imprinting effects and all discordant relative pairs within each pedigree, outperform all the existing test statistics and MCGDTI can recapture much of the missing information. Therefore, MCGDTI is recommended in practice.
Abbreviations
 GDT:

Generalized disequilibrium test with the variance estimated based on the information on kinship coefficients when identity by descent is unknown
 GDTI:

Generalized disequilibrium test with imprinting
 GDTME:

Generalized disequilibrium test based on the moment estimate of the variance
 IBD:

Identity by descent
 LD:

Linkage disequilibrium
 MC:

Monte Carlo
 MCGDTI:

Monte Carlo GDTI
 MCGDTME:

Monte Carlo GDTME
 MCPDTI:

Monte Carlo pedigree disequilibrium test with imprinting
 MCPPAT:

Monte Carlo pedigree parentalasymmetry test
 PDTI:

Pedigree disequilibrium test with imprinting
 PS:

Population stratification
 RA:

Rheumatoid arthritis
 RR:

Relative risk
 SNP:

Single nucleotide polymorphism
References
 1.
Martienssen RA, Colot V. DNA methylation and epigenetic inheritance in plants and filamentous fungi. Science. 2001;293(5532):1070–4.
 2.
Feil R, Berger F. Convergent evolution of genomic imprinting in plants and mammals. Trends Genet. 2007;23(4):192–9.
 3.
Peters J. The role of genomic imprinting in biology and disease: an expanding view. Nat Rev Genet. 2014;15(8):517–30.
 4.
Morison IM, Paton CJ, Cleverley SD. The imprinted gene and parentoforigin effect database. Nucleic Acids Res. 2001;29(1):275–6.
 5.
Morison IM, Paton CJ, Cleverley SD. The imprinted gene and parentoforigin effect database. 2001. http://igc.otago.ac.nz. Accessed 26 Mar 2017.
 6.
Glaser RL, Ramsay JP, Morison IM. The imprinted gene and parentoforigin effect database now includes parental origin of de novo mutations. Nucleic Acids Res. 2006;34(Suppl 1):D29–31.
 7.
Scharfmann R, Shield JPH. Development of the pancreas and neonatal diabetes. 1st ed. Switzerland: Karger; 2007.
 8.
Falls JG, Pulford DJ, Wylie AA, Jirtle RL. Genomic imprinting: implications for human disease. Am J Pathol. 1999;154(3):635–47.
 9.
Ziegler A, König IR, Pahlke F. A statistical approach to genetic epidemiology: concepts and applications, with an Elearning platform. 2nd ed. Germany: WileyVCH; 2010.
 10.
Zhou JY, Mao WG, Li DL, YQ H, Xia F, Fung WK. A powerful parentoforigin effects test for qualitative traits incorporating control children in nuclear families. J Hum Genet. 2012;57(8):500–7.
 11.
Spielman RS, McGinnis RE, Ewens WJ. Transmission test for linkage disequilibrium: the insulin gene region and insulindependent diabetes mellitus (IDDM). Am J Hum Genet. 1993;52(3):506–16.
 12.
Horvath S, Xu X, Laird NM. The family based association test method: strategies for studying general genotypephenotype associations. Eur J Hum Genet. 2001;9(4):301–6.
 13.
Martin ER, Monks SA, Warren LL, Kaplan NL. A test for linkage and association in general pedigrees: the pedigree disequilibrium test. Am J Hum Genet. 2000;67(1):146–54.
 14.
Laird NM, Horvath S, Xu X. Implementing a unified approach to familybased tests of association. Genet Epidemiol. 2000;19(Suppl 1):S36–42.
 15.
Chen WM, Manichaikul A, Rich SS. A generalized familybased association test for dichotomous traits. Am J Hum Genet. 2009;85(3):364–76.
 16.
Xia F, Zhou JY, Fung WK. A powerful approach for association analysis incorporating imprinting effects. Bioinformatics. 2011;27(18):2571–7.
 17.
Zhou JY, He HQ, You XP, Li SZ, Chen PY, Fung WK. A powerful association test for qualitative traits incorporating imprinting effects using general pedigree data. J Hum Genet. 2015;60(2):77–83.
 18.
Zhou JY, Ding J, Fung WK, Lin S. Detection of parentoforigin effects using general pedigree data. Genet Epidemiol. 2010;34(2):151–8.
 19.
Ding J, Lin S, Liu Y. Monte Carlo pedigree disequilibrium test for markers on the X chromosome. Am J Hum Genet. 2006;79(3):567–73.
 20.
Amos CI, Chen WV, Remmers E, Siminovitch KA, Seldin MF, Criswell LA, et al. Data for genetic analysis workshop (GAW) 15 problem 2, genetic causes of rheumatoid arthritis and associated traits. BMC Proc. 2007;1(Suppl 1):S3.
 21.
Ott J, Lathrop GM. SLINK: a general simulation program for linkage analysis. Am J Hum Genet. 1990;47:A204.
 22.
Team RC. R: a language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2013. http://www.rproject.org. 2017.
 23.
Genetic Analysis Workshop. 1982. https://www.gaworkshop.org. Accessed 26 Mar 2017.
 24.
Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73(1):13–22.
 25.
Liang KY, Pulver AE. Analysis of casecontrol/family sampling design. Genet Epidemiol. 1996;13(3):253–70.
Acknowledgments
The authors thank the reviewer for helpful comments that greatly improve the presentation of the article. The authors thank the Genetic Analysis Workshops for providing the RA data, which were supported by the National Institutes of Health grant R01 GM031575. The RA data were gathered with the support of grants from the National Institutes of Health grants N01AR22263 and R01AR44422, and the National Arthritis Foundation.
Funding
This work was supported by the National Natural Science Foundation of China grants 81,373,098, 81,773,544 and 81,573,207, Science and Technology Planning Project of Guangdong Province of China grant 2013B021800038 and the Hong Kong RGC GRF Research Grant 17,301,715.
Availability of data and materials
The dataset supporting the conclusions of this article is from North American Rheumatoid Arthritis Consortium, which is made available from Genetic Analysis Workshop 15 (http://www.gaworkshop.org/). Our software MCGDTI is freely available at http://www.echobelt.org/web/UploadFiles/MCGDTI.html, which is implemented in R (http://www.rproject.org/).
Author information
Affiliations
Contributions
JLL, PW, WKF and JYZ all contributed to the study design, analytical preparation and the writing of the manuscript. JLL and PW performed the simulation studies. JLL, WKF and JYZ analyzed the data and revised the manuscript. All authors have read and approved the final manuscript.
Corresponding authors
Correspondence to Wing Kam Fung or JiYuan Zhou.
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional files
Additional file 1: Appendix.
Construction of the GDTI test statistic. Table S1. Pvalues of the test statistics applied to RA data at 3 SNPs with P _{ MCGDTI }< 9.247,3 × 10^{−6}. Figures S1  S3. Simulated powers of all the test statistics. The test statistics are T1: GDTI, T2: MCGDTI_{T}, T3: MCGDTI_{E}, T4: MCGDTME_{T}, T5: MCGDTME_{E}, T6: GDTME, T7: GDT, T8: MCPDTI_{T} and T9: MCPDTI_{E}. The simulations are conducted under complete, incomplete and no imprinting effect models at 1% significance level based on 10,000 replicates for 90 pedigrees when LD = 0.092,5, 0.142,5, and 0.157,5, and RR = 1.500, 1.833 and 2.182, respectively. The first 5 statistics are proposed tests, while the remaining 4 are existing tests. (PDF 76 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Received
Accepted
Published
DOI
Keywords
 Generalized disequilibrium test
 Genomic imprinting
 Monte Carlo sampling
 Qualitative trait